You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue captures the proposed support model for AMD SEV-SNP in dstack, starting with the bare-metal path staged in #703 and extending to cloud backends. It is meant to separate the design and production-readiness discussion from the PR implementation review.
Status
#703 adds an experimental, explicitly opt-in SEV-SNP platform mode for dstack-managed VMs.
The current implementation is intentionally conservative:
platform = "auto" does not select SNP. Operators must set platform = "amd-sev-snp".
SNP key and certificate release is disabled by default in KMS.
Enabling SNP release requires both external auth approval and the local KMS release gate.
The SNP code path is separate from the TDX path and does not change TDX attestation behavior.
The current design is close to usable for bare-metal review and hardware testing. It is not the final design for GCP, Azure, AWS, or production SNP with confidential GPUs.
Goals
Add a bare-metal AMD SEV-SNP attestation path for VMs launched by dstack.
Bind SNP key release to hardware-signed evidence and to the app identity dstack intended to launch.
Keep sensitive release paths fail-closed unless SNP support is explicitly enabled.
Preserve the existing TDX trust model and behavior.
Define how SNP should extend to the major cloud providers without forcing one backend to cover incompatible launch models.
Non-goals
Do not make SNP part of automatic platform selection while the ACPI and collateral issues remain open.
Do not use the bare-metal launch-measurement recomputation path for cloud VMs where dstack does not control the launch.
Do not claim SNP runtime composability equivalent to TDX RTMR3 until a vTPM/PCR-backed design exists.
Do not add AWS SEV-SNP support by treating VLEK evidence as VCEK evidence.
Design summary
Bare-metal SNP uses direct AMD report verification plus deterministic launch-measurement recomputation.
The flow is:
The VMM launches a SNP guest with QEMU and includes the dstack app identity in the measured kernel command line.
The guest obtains an AMD SEV-SNP report and sends it through the existing v1 attestation envelope.
KMS verifies the AMD report signature, report policy, and report-data challenge.
KMS recomputes the expected SNP launch measurement from trusted configuration and VMM-provided launch inputs.
KMS compares the recomputed value to the hardware-signed report measurement.
KMS builds a BootInfo and runs the existing auth policy.
KMS releases key or certificate material only if the local SNP release gate also allows the verified BootInfo.
This is deliberately different from TDX. TDX exposes RTMR registers and a runtime event log; dstack extracts app identity from the signed RTMR state and replays the event log. SNP exposes one launch MEASUREMENT and does not provide an RTMR3-equivalent runtime extension register. Because of that, dstack has to put the security-critical app identity into the measured launch inputs and recompute the launch measurement later.
app-id, compose hash, and KMS binding in RTMR3 events
app_id, compose hash, and rootfs hash in measured kernel command line
Verification strategy
Extract signed registers and replay runtime events
Recompute launch measurement and compare byte-for-byte
KMS binding
KMS key-provider event measured into RTMR3
Not equivalent yet; see KMS binding gap below
Composability
Supported
Not supported in this mode
Report signing key
Intel DCAP chain
AMD VCEK only; VLEK is rejected
The recompute strategy is valid only when dstack controls the launch. It requires the KMS, VMM, QEMU, OVMF, kernel, initrd, vCPU topology, CPU model, guest feature bits, and kernel command line to stay in exact agreement. Drift should reject attestation rather than accept the wrong guest, so the failure mode is denial of release, not false authorization.
Bare-metal implementation
VMM launch
The SNP VMM path is selected only for explicit platform = "amd-sev-snp" launches.
The VMM:
uses QEMU's sev-snp-guest object with kernel hashes enabled;
uses the EPYC-v4 CPU model for measurement compatibility;
appends docker_compose_hash, rootfs_hash, and app_id to the measured kernel command line;
uses SNP-compatible virtio PCI options;
writes sev_snp_measurement into .sys-config.json so KMS can recompute the same launch measurement.
The measured fields are the dstack app identity for the SNP path. If any field differs from what QEMU measured at launch, KMS recomputation fails and no key material is released.
Guest evidence
The guest obtains a SNP report through the kernel TSM configfs interface when available, or through the SEV-SNP guest device path. The report is carried in the existing v1 attestation format as DstackAmdSevSnp.
The verifier checks:
report length and parseability;
AMD certificate chain and report signature;
exact report_data challenge binding;
VMPL0;
no debug policy;
no migration-agent policy;
unmasked chip key;
VCEK signing key;
policy/platform consistency for SMT, RAPL, and ciphertext-hiding bits.
KMS measurement binding
After report verification, KMS parses sev_snp_measurement from vm_config and recomputes the expected launch measurement. The recomputation input includes:
app_id;
compose hash;
rootfs hash;
base kernel command line;
kernel hash;
initrd hash;
OVMF hash or OVMF metadata;
vCPU count;
CPU model;
SNP guest feature bits.
KMS accepts the SNP identity only when the recomputed measurement exactly matches the hardware-signed report measurement.
The SNP BootInfo then uses:
mr_aggregated: the hardware-signed SNP launch measurement;
os_image_hash: the rootfs hash;
device_id: the 64-byte SNP chip_id;
tcb_status: UpToDate when the report's current, reported, committed, and launch TCB values all match; OutOfDate otherwise;
advisory_ids: currently empty, because the report and VCEK evidence used here do not carry an advisory list.
Key release gate
KMS applies the existing auth flow and then a local SNP release gate. The default KMS config keeps SNP release disabled:
KMS startup rejects SNP release if enforce_self_authorization = false. This prevents a production KMS from enabling SNP release while bypassing its own self-attestation path.
Security properties
The current design provides these properties for bare-metal SNP:
The report must be signed by AMD SNP hardware using a supported VCEK chain.
The report must bind the KMS challenge through report_data.
The guest policy must reject debug, migration-agent, masked-chip-key, and unsupported signing-key modes.
App identity is launch-measured through the kernel command line.
KMS recomputes the launch measurement independently and rejects mismatches.
External auth policy must allow the resulting BootInfo.
Local KMS config must explicitly enable SNP key release.
TDX behavior remains isolated from the SNP path.
Production blockers
AMD root pinning
The normal cert-chain path pins the Genoa AMD Root Key (ARK). The empty-cert-chain KDS collateral fetch path currently verifies an internally consistent fetched chain but does not compare the fetched ARK to dstack's pinned ARK.
Before production use:
pin the AMD ARK for every accepted product family;
verify KDS-fetched collateral against the pinned product ARK;
reject products whose root is not pinned.
The current code loops over Genoa, Milan, Bergamo, Siena, and Turin, but only the Genoa root is pinned. Either support those products with pinned roots or narrow the accepted product set.
KDS fetching
The current KDS collateral fetch path uses blocking HTTP in an async verification flow and creates a fresh client for each request.
Before production use:
use an async HTTP client;
set explicit request timeouts;
cache AMD collateral by product, chip ID, and TCB values;
keep collateral validation fail-closed.
This is an availability and operations issue, not a reason to weaken verification.
KMS binding gap
TDX records the KMS key-provider identity in RTMR3. Bare-metal SNP currently derives key_provider_info from the app's launch identity and chip ID. That value is deterministic, but it does not independently prove which KMS public key the guest was configured to trust.
The low-complexity fix is to include a hash of the KMS root public key in the SNP measured kernel command line. KMS should recompute the launch measurement with that value, exactly as it does for app_id, compose hash, and rootfs hash.
Until that lands, SNP key_provider_info should not be treated as semantically equivalent to the TDX key-provider event.
ACPI and BadAML
The main security gap is host-supplied ACPI. SNP measured direct boot covers the firmware, kernel, initrd, command line, and vCPU launch state, but it does not cover the ACPI tables that QEMU supplies to the guest. ACPI tables can carry AML bytecode executed by the guest kernel early in boot.
This is a published vulnerability class:
AMD-SB-3012 describes malicious ACPI AML in QEMU SEV-SNP measured direct boot and recommends measuring or sanitizing ACPI tables, including vTPM-based approaches.
GHSA-g9ww-x58f-9g6m documents BadAML against Edgeless Contrast, with affected SNP platforms and a patched version in Contrast 1.18.0.
dstack's bare-metal SNP path is in the same measured-direct-boot category unless the guest image adds an ACPI defense.
Before SNP is selected automatically or described as production-ready, dstack should ship one of:
a guest-kernel AML sandbox that prevents AML from reading or writing private guest memory; or
a vTPM/SVSM measured-boot design that measures ACPI into PCRs and verifies those PCRs during attestation.
The AML sandbox is the smaller direct fix for BadAML. A vTPM/SVSM design is broader: it also restores runtime PCRs that look more like TDX RTMRs, but it changes the bare-metal architecture substantially.
Platform strategy
The right SNP backend is determined by who controls VM launch.
If dstack launches the VM, dstack can recompute the launch measurement. If a cloud provider launches the VM, dstack cannot reliably reproduce the provider's firmware, vCPU, and measured-boot inputs. In that case dstack should use the provider's vTPM and endorsed reference values.
Keep the #703 model for bare metal: direct SNP report verification plus launch-measurement recomputation.
Before promotion beyond experimental:
Fix AMD root pinning and KDS behavior.
Add the KMS public-key hash to the measured command line.
Ship an ACPI defense in the guest image.
Build and publish a coherent meta-dstack SNP guest image that contains the matching guest-side attestation code.
Keep runtime events informational unless a vTPM/PCR design is added.
SVSM remains an option if dstack later needs true runtime PCR extension on bare metal. It is not required to make fixed-identity bare-metal SNP useful.
GCP
GCP SNP should be a separate backend. Google documents both Google-managed vTPM attestation and direct AMD Secure Processor reports for Confidential VM instances. Google also caches VCEK certificates on the VM.
The likely design is similar to dstack's existing GCP TDX shape:
verify the provider vTPM quote and relevant PCRs;
verify or consume the AMD SNP hardware report;
bind runtime events through a provider-backed TPM PCR;
avoid bare-metal launch recomputation.
The open design question is exactly how GCP binds the SNP hardware report to the vTPM quote for the intended dstack trust policy.
Azure
Azure SNP should be its own backend. Azure confidential VMs provide a per-VM vTPM isolated by AMD SEV-SNP. Azure's documentation also describes using the vTPM to retrieve the AMD SNP report and bind vTPM PCR measurements to the report through the vTPM attestation key.
The Azure path should verify Azure's vTPM and platform claims rather than reuse the bare-metal recomputation model.
AWS
AWS is not just another VCEK cloud backend. AWS documents SEV-SNP reports signed with VLEK, while #703 currently accepts only VCEK and rejects VLEK evidence.
AWS support requires a report-layer verifier change before any NitroTPM or platform policy work:
accept the AMD VLEK signing-key mode only for an AWS-specific backend;
fetch and validate AMD's VLEK certificate chain;
keep AWS policy separate from VCEK-based bare-metal, GCP, and Azure paths.
Confidential GPU implications
NVIDIA H100 confidential computing can work with VM-based CPU TEEs, including AMD SEV-SNP and Intel TDX. The CPU TEE still has to be attested separately from the GPU, and GPU attestation remains NVIDIA-specific.
Environment
SNP confidential GPU path
Bare-metal SNP
Supported design target: pass through H100, verify SNP CPU evidence, then verify NVIDIA GPU evidence.
Azure SNP
Azure offers confidential GPU VMs based on AMD SEV-SNP and NVIDIA H100. This belongs in the Azure backend.
GCP
Google documents confidential GPU VMs using a3-highgpu-1g and Intel TDX, not SNP. This is a TDX backend concern.
AWS
AWS public SEV-SNP documentation currently lists CPU-only m6a, c6a, and r6a families. Treat SNP GPU support as out of scope until AWS exposes a supported confidential GPU platform.
This issue captures the proposed support model for AMD SEV-SNP in dstack, starting with the bare-metal path staged in #703 and extending to cloud backends. It is meant to separate the design and production-readiness discussion from the PR implementation review.
Status
#703 adds an experimental, explicitly opt-in SEV-SNP platform mode for dstack-managed VMs.
The current implementation is intentionally conservative:
platform = "auto"does not select SNP. Operators must setplatform = "amd-sev-snp".The current design is close to usable for bare-metal review and hardware testing. It is not the final design for GCP, Azure, AWS, or production SNP with confidential GPUs.
Goals
Non-goals
Design summary
Bare-metal SNP uses direct AMD report verification plus deterministic launch-measurement recomputation.
The flow is:
BootInfoand runs the existing auth policy.BootInfo.This is deliberately different from TDX. TDX exposes RTMR registers and a runtime event log; dstack extracts app identity from the signed RTMR state and replays the event log. SNP exposes one launch
MEASUREMENTand does not provide an RTMR3-equivalent runtime extension register. Because of that, dstack has to put the security-critical app identity into the measured launch inputs and recompute the launch measurement later.TDX and SNP attestation primitives
MEASUREMENTapp-id, compose hash, and KMS binding in RTMR3 eventsapp_id, compose hash, and rootfs hash in measured kernel command lineThe recompute strategy is valid only when dstack controls the launch. It requires the KMS, VMM, QEMU, OVMF, kernel, initrd, vCPU topology, CPU model, guest feature bits, and kernel command line to stay in exact agreement. Drift should reject attestation rather than accept the wrong guest, so the failure mode is denial of release, not false authorization.
Bare-metal implementation
VMM launch
The SNP VMM path is selected only for explicit
platform = "amd-sev-snp"launches.The VMM:
sev-snp-guestobject with kernel hashes enabled;EPYC-v4CPU model for measurement compatibility;docker_compose_hash,rootfs_hash, andapp_idto the measured kernel command line;sev_snp_measurementinto.sys-config.jsonso KMS can recompute the same launch measurement.The measured fields are the dstack app identity for the SNP path. If any field differs from what QEMU measured at launch, KMS recomputation fails and no key material is released.
Guest evidence
The guest obtains a SNP report through the kernel TSM configfs interface when available, or through the SEV-SNP guest device path. The report is carried in the existing v1 attestation format as
DstackAmdSevSnp.The verifier checks:
report_datachallenge binding;KMS measurement binding
After report verification, KMS parses
sev_snp_measurementfromvm_configand recomputes the expected launch measurement. The recomputation input includes:app_id;KMS accepts the SNP identity only when the recomputed measurement exactly matches the hardware-signed report measurement.
The SNP
BootInfothen uses:mr_aggregated: the hardware-signed SNP launch measurement;os_image_hash: the rootfs hash;device_id: the 64-byte SNPchip_id;tcb_status:UpToDatewhen the report's current, reported, committed, and launch TCB values all match;OutOfDateotherwise;advisory_ids: currently empty, because the report and VCEK evidence used here do not carry an advisory list.Key release gate
KMS applies the existing auth flow and then a local SNP release gate. The default KMS config keeps SNP release disabled:
The gate protects:
GetAppKey;GetKmsKey;GetTempCaCert;SignCert.KMS startup rejects SNP release if
enforce_self_authorization = false. This prevents a production KMS from enabling SNP release while bypassing its own self-attestation path.Security properties
The current design provides these properties for bare-metal SNP:
report_data.BootInfo.Production blockers
AMD root pinning
The normal cert-chain path pins the Genoa AMD Root Key (ARK). The empty-cert-chain KDS collateral fetch path currently verifies an internally consistent fetched chain but does not compare the fetched ARK to dstack's pinned ARK.
Before production use:
The current code loops over
Genoa,Milan,Bergamo,Siena, andTurin, but only the Genoa root is pinned. Either support those products with pinned roots or narrow the accepted product set.KDS fetching
The current KDS collateral fetch path uses blocking HTTP in an async verification flow and creates a fresh client for each request.
Before production use:
This is an availability and operations issue, not a reason to weaken verification.
KMS binding gap
TDX records the KMS key-provider identity in RTMR3. Bare-metal SNP currently derives
key_provider_infofrom the app's launch identity and chip ID. That value is deterministic, but it does not independently prove which KMS public key the guest was configured to trust.The low-complexity fix is to include a hash of the KMS root public key in the SNP measured kernel command line. KMS should recompute the launch measurement with that value, exactly as it does for
app_id, compose hash, and rootfs hash.Until that lands, SNP
key_provider_infoshould not be treated as semantically equivalent to the TDX key-provider event.ACPI and BadAML
The main security gap is host-supplied ACPI. SNP measured direct boot covers the firmware, kernel, initrd, command line, and vCPU launch state, but it does not cover the ACPI tables that QEMU supplies to the guest. ACPI tables can carry AML bytecode executed by the guest kernel early in boot.
This is a published vulnerability class:
dstack's bare-metal SNP path is in the same measured-direct-boot category unless the guest image adds an ACPI defense.
Before SNP is selected automatically or described as production-ready, dstack should ship one of:
The AML sandbox is the smaller direct fix for BadAML. A vTPM/SVSM design is broader: it also restores runtime PCRs that look more like TDX RTMRs, but it changes the bare-metal architecture substantially.
Platform strategy
The right SNP backend is determined by who controls VM launch.
If dstack launches the VM, dstack can recompute the launch measurement. If a cloud provider launches the VM, dstack cannot reliably reproduce the provider's firmware, vCPU, and measured-boot inputs. In that case dstack should use the provider's vTPM and endorsed reference values.
Bare metal
Keep the #703 model for bare metal: direct SNP report verification plus launch-measurement recomputation.
Before promotion beyond experimental:
meta-dstackSNP guest image that contains the matching guest-side attestation code.SVSM remains an option if dstack later needs true runtime PCR extension on bare metal. It is not required to make fixed-identity bare-metal SNP useful.
GCP
GCP SNP should be a separate backend. Google documents both Google-managed vTPM attestation and direct AMD Secure Processor reports for Confidential VM instances. Google also caches VCEK certificates on the VM.
The likely design is similar to dstack's existing GCP TDX shape:
The open design question is exactly how GCP binds the SNP hardware report to the vTPM quote for the intended dstack trust policy.
Azure
Azure SNP should be its own backend. Azure confidential VMs provide a per-VM vTPM isolated by AMD SEV-SNP. Azure's documentation also describes using the vTPM to retrieve the AMD SNP report and bind vTPM PCR measurements to the report through the vTPM attestation key.
The Azure path should verify Azure's vTPM and platform claims rather than reuse the bare-metal recomputation model.
AWS
AWS is not just another VCEK cloud backend. AWS documents SEV-SNP reports signed with VLEK, while #703 currently accepts only VCEK and rejects VLEK evidence.
AWS support requires a report-layer verifier change before any NitroTPM or platform policy work:
Confidential GPU implications
NVIDIA H100 confidential computing can work with VM-based CPU TEEs, including AMD SEV-SNP and Intel TDX. The CPU TEE still has to be attested separately from the GPU, and GPU attestation remains NVIDIA-specific.
a3-highgpu-1gand Intel TDX, not SNP. This is a TDX backend concern.m6a,c6a, andr6afamilies. Treat SNP GPU support as out of scope until AWS exposes a supported confidential GPU platform.Decisions
Open questions
meta-dstackpublishes the SNP guest image, which kernel config should be the minimum supported baseline?References