feat: stage AMD SEV-SNP attestation support#703
Conversation
|
SEV-SNP TCB/advisory policy slice is pushed. What changed:
Still fail-closed:
Validation:
|
|
Continued with the next quality-gate slice and pushed a small clippy cleanup commit. Commit:
What changed:
Validation now passing:
Independent review of the cleanup diff found no behavior/security regressions. |
|
Milestone 1 is done: PR #703 is now review-ready staging for AMD SEV-SNP, still without production key release. New commit:
What changed:
Validation passed after doc/proof refresh:
I am marking the PR ready for review now. Milestone 2 remains separate: production SNP key release policy + revocation/advisory collateral + guarded release enablement. |
|
Milestone 2 is now implemented and pushed. Commit: What changed:
Validation passed: cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest runIndependent security review: no release-gate blockers found after the self-authorization startup-safety fix. |
SNP E2E smoke follow-upI kept going on the manual SNP smoke on What the smoke found/fixed:
Smoke status:
Validation passed after the fixes: cargo fmt --all
cargo test -p dstack-vmm --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run |
AMD SEV-SNP manual E2E smoke updateI pushed a follow-up commit that completes the dstack-managed SNP smoke path:
What the smoke proved
Failure gate also exercisedThe lab host reports verifier-derived Then, with an explicit lab-only allowlist ( Fixes included
Validation runAll passed locally: cargo fmt --all
cargo test -p dstack-attest --all-features
cargo test -p dstack-util --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest runNo secret/key material was included in logs or this comment. |
|
Thank you so much for this — it's a huge and impressively thorough piece of work. 🙏 I have some other things on my plate right now, but I'll review this once I'm through them. Thanks again! |
|
Fresh-box SNP smoke update (sanitized):
Validation after doc/script update: Commit: No secrets or credential material included in this update. |
|
Fresh-box SNP smoke follow-up pushed in What changed:
Remote smoke evidence from
So the remaining gap is not guest boot, VMM wiring, KMS startup, or release-policy plumbing. The current blocker is external AMD KDS collateral availability/rate-limiting for the app quote. Validation passed after this update: No secret material is included in the docs or PR evidence. |
|
Update: completed the fresh-box SNP smoke through both gates and pushed the follow-up fix. What changed in the latest commit (
Remote smoke result on Validation rerun after the final patch:
No secrets or key material were included in the smoke artifacts or PR note. |
85ace8b to
b9d968d
Compare
|
Published the AMD SEV-SNP support design and production-readiness tracker here: #713 This separates the broader design discussion from the PR implementation review. The issue captures:
|
|
Applied the Chipotle-agent feedback to the PR docs/body without pulling Chipotle-specific app config into dstack scope. Updates pushed in
Validated before push: bash -n test-scripts/snp-e2e-smoke.sh
cargo fmt --all --check
git diff --checkChipotle-specific note deliberately left out of dstack implementation/docs except as out-of-scope in the PR body: the app needed |
Summary
This PR stages AMD SEV-SNP as a first-class dstack attestation platform alongside existing TDX/Nitro/GCP paths, and includes a controlled/fail-closed SNP key/cert release path.
At a high level, this branch:
/dev/sev-guestextended-report ioctl fallback.DSTACK_AMD_KDS_PROXY_URL/ KMSsev_snp.amd_kds_proxy_urlfor lab hosts that hit AMD KDS throttling.app_idlaunch-measured for SNP by binding app identity into the measured kernel cmdline.BootInfofrom verified evidence: measurement, chip id, app id, compose hash, rootfs hash, TCB status, and advisory ids.test-scripts/snp-e2e-smoke.shas a reusable manual hardware smoke script.Default security posture
SNP release remains fail-closed by default.
Defaults:
Sensitive release surfaces guarded by this gate:
GetAppKeyGetKmsKeySignCertGetTempCaCertAdditional safety: KMS startup rejects SNP release enablement unless
enforce_self_authorization = true, so the self-authorized temp-CA path cannot silently bypass SNP release policy.Even when local release is enabled, external auth must still allow the verified SNP
BootInfo.AMD KDS collateral proxy support
The lab SNP host hit direct AMD KDS
HTTP 429while fetching VCEK/cert-chain collateral. This PR preserves fail-closed verification and adds an explicit proxy/cache path instead of bypassing cert verification.Important details:
dstack-attestrespectsDSTACK_AMD_KDS_PROXY_URLfor AMD KDS cert-chain and VCEK fetches.core.sev_snp.amd_kds_proxy_url.kms/src/main.rsexports the configured proxy before attestation verification.ra-rpc::QuoteVerifiercarries/re-applies the proxy around per-request quote verification.dstack.amd_kds_proxy_url=...in the kernel cmdline;basefiles/dstack-prepare.shexportsDSTACK_AMD_KDS_PROXY_URLand writes/run/dstack/environment;basefiles/dstack-guest-agent.serviceloads that file viaEnvironmentFile=-/run/dstack/environment.The Lit proxy shape used in the smoke is path-prefix passthrough:
not a
?url=wrapper.Hardware smoke proof
Manual hardware smoke was rerun on the SNP host:
Latest sanitized result:
Lab success used:
Production defaults still deny
OutOfDateTCB and keepallowed_advisory_ids = [].Image requirement
The working guest image was a coherent
meta-dstackimage built with:Do not use the default TDX image for SNP smoke. A coherent PR image built with the default
tdxmachine produced a6.18.24-dstackkernel with# CONFIG_AMD_MEM_ENCRYPT is not set; controlled QEMU tests showed that kernel resets immediately after OVMF loads kernel/initrd. SNP-capable kernels booted the same QEMU/OVMF path to Linux/SNP markers.Also do not rely on ad-hoc
dstack-utilinjection into a stock image. That changed measurement/boot behavior and regressed the boundary. For full app-key success, use a coherentmeta-dstackimage whose kernel/modules/initramfs/rootfs/verity metadata and guest userspace include the same PR branch.Quote / attestation proof
Earlier guest quote proof confirmed the SNP guest can produce a hardware report containing the expected challenge bytes:
The final KMS smoke additionally proves that the app guest's SNP evidence verifies through KMS/auth successfully enough to exercise strict denial and lab success release gates.
Measurement proof
A live golden-vector test on an SNP-capable host cross-checks dstack's pure Rust SNP measurement recomputation against
sev-snp-measure:cargo test -p dstack-kms --all-features recomputation_matches_sev_snp_measure_live_golden_vector -- --ignored --nocaptureLatest recorded proof:
See
docs/amd-sev-snp-review-readiness.mdfor the fuller proof block and review boundary.Important implementation notes
Key fixes discovered during E2E smoke:
.sys-config.jsonnow includessev_snp_measurementso KMS can recompute the same SNP launch measurement used by QEMU.rootfs_hashonly in kernel cmdline (dstack.rootfs_hash=...), so VMM/KMS now preserve and use that path.docker_compose_hash,rootfs_hash, andapp_id.EPYC-v4and confidential virtio PCI options (disable-legacy=on,iommu_platform=true).chip_id+ reported TCB when local evidence lacks cert collateral.mr_config_idchecks while preserving non-SNP behavior.dstack-prepare.shhandles SNP guest detection, earlychronycunavailability, minimal smoke DNS fallback, and AMD KDS proxy propagation.DSTACK_SNP_SMOKE_KDS_PROXY_URL, configurable VMM ports/URL, port cleanup viafuser, better strict-TCB denial detection, and clearer KDS-blocked vs policy-denied logs.Validation run
All passed locally on the final branch head:
Known limitations / follow-ups
platform = "auto"remains conservative while SNP is experimental. Operators must explicitly setplatform = "amd-sev-snp".advisory_idsis currently explicit and empty. Future advisory/revocation collateral should populate it and will be denied unless explicitly allowlisted.test-scripts/snp-e2e-smoke.sh.meta-dstackSNP guest image.tcbStatus = "OutOfDate"; success required an explicit lab allowlist. Production defaults still deny this.Human review focus
Please pay special attention to:
Fail-closed release semantics
UpToDateonly by default.Measurement / identity binding
app_id, compose hash, rootfs hash, kernel/initrd/cmdline, OVMF, vCPU model, guest features, and optional smoke proxy cmdline are all part of recomputation or policy input.app_idis launch-measured, not just auth metadata.AMD KDS collateral fallback/proxy
Non-SNP regression risk
DstackAmdSevSnp.Operational policy choice
UpToDateTCB in production should remain an explicit operator decision, not a default.