Skip to content

feat: stage AMD SEV-SNP attestation support#703

Open
clawdbot-glitch003 wants to merge 34 commits into
Dstack-TEE:masterfrom
clawdbot-glitch003:feat/amd-sev-snp-conversion
Open

feat: stage AMD SEV-SNP attestation support#703
clawdbot-glitch003 wants to merge 34 commits into
Dstack-TEE:masterfrom
clawdbot-glitch003:feat/amd-sev-snp-conversion

Conversation

@clawdbot-glitch003
Copy link
Copy Markdown

@clawdbot-glitch003 clawdbot-glitch003 commented Jun 1, 2026

Summary

This PR stages AMD SEV-SNP as a first-class dstack attestation platform alongside existing TDX/Nitro/GCP paths, and includes a controlled/fail-closed SNP key/cert release path.

At a high level, this branch:

  • Adds AMD SEV-SNP evidence plumbing to the v1 attestation format.
  • Collects SNP reports from Linux guest interfaces:
    • configfs TSM first;
    • /dev/sev-guest extended-report ioctl fallback.
  • Verifies SNP reports against AMD ARK/ASK/VCEK collateral, including report-data challenge binding and signed-report policy checks.
  • Adds fail-closed AMD KDS collateral augmentation when local evidence lacks ASK/VCEK, using report chip id + reported TCB.
  • Supports an explicit AMD KDS collateral proxy via DSTACK_AMD_KDS_PROXY_URL / KMS sev_snp.amd_kds_proxy_url for lab hosts that hit AMD KDS throttling.
  • Recomputes SNP launch measurement from OVMF/kernel/initrd/cmdline inputs and compares it to the hardware-verified report measurement.
  • Makes app_id launch-measured for SNP by binding app identity into the measured kernel cmdline.
  • Builds SNP-aware KMS BootInfo from verified evidence: measurement, chip id, app id, compose hash, rootfs hash, TCB status, and advisory ids.
  • Routes SNP KMS/app authorization through the existing auth flow.
  • Adds an explicit local KMS release gate for sensitive SNP outputs.
  • Adds test-scripts/snp-e2e-smoke.sh as a reusable manual hardware smoke script.

Default security posture

SNP release remains fail-closed by default.

Defaults:

[core.sev_snp_key_release]
enabled = false
allowed_tcb_statuses = ["UpToDate"]
allowed_advisory_ids = []

Sensitive release surfaces guarded by this gate:

  • GetAppKey
  • GetKmsKey
  • SignCert
  • self-authorized GetTempCaCert

Additional safety: KMS startup rejects SNP release enablement unless enforce_self_authorization = true, so the self-authorized temp-CA path cannot silently bypass SNP release policy.

Even when local release is enabled, external auth must still allow the verified SNP BootInfo.

AMD KDS collateral proxy support

The lab SNP host hit direct AMD KDS HTTP 429 while fetching VCEK/cert-chain collateral. This PR preserves fail-closed verification and adds an explicit proxy/cache path instead of bypassing cert verification.

Important details:

  • dstack-attest respects DSTACK_AMD_KDS_PROXY_URL for AMD KDS cert-chain and VCEK fetches.
  • KMS config supports core.sev_snp.amd_kds_proxy_url.
  • kms/src/main.rs exports the configured proxy before attestation verification.
  • ra-rpc::QuoteVerifier carries/re-applies the proxy around per-request quote verification.
  • The guest receives dstack.amd_kds_proxy_url=... in the kernel cmdline; basefiles/dstack-prepare.sh exports DSTACK_AMD_KDS_PROXY_URL and writes /run/dstack/environment; basefiles/dstack-guest-agent.service loads that file via EnvironmentFile=-/run/dstack/environment.
  • When the proxy is passed in the launched guest cmdline, VMM/KMS measurement recomputation includes the same cmdline fragment to avoid SNP measurement drift.

The Lit proxy shape used in the smoke is path-prefix passthrough:

https://cors.litgateway.com/https://kdsintf.amd.com/...

not a ?url= wrapper.

Hardware smoke proof

Manual hardware smoke was rerun on the SNP host:

remote_host=chris@173.234.27.162
host_kernel=Linux 6.11.0-rc3-snp-host-85ef1ac03941
qemu_version=10.0.2
ovmf_path=/opt/AMDSEV/usr/local/share/qemu/OVMF.fd
ovmf_sha256=67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a
image=dstack-dev-0.6.0
platform=amd-sev-snp
image_kernel=Linux 6.18.24-dstack with CONFIG_AMD_MEM_ENCRYPT=y, CONFIG_SEV_GUEST=y, CONFIG_TSM_REPORTS=y

Latest sanitized result:

kms_guest=booted SNP Linux/userspace and started dstack-kms
kms_marker=SNP_KMS_CONTAINER_STARTED / KMS runtime ready
kds_proxy=enabled for smoke via DSTACK_SNP_SMOKE_KDS_PROXY_URL=https://cors.litgateway.com/
strict_tcb_probe=denied_as_expected with tcb_status is not allowed
success_probe=GetTempCaCert HTTP 200; GetAppKey HTTP 200; SignCert HTTP 200; app container started
smoke_result=SNP E2E smoke success
no_secret_material_logged=true

Lab success used:

DSTACK_SNP_SMOKE_ALLOW_OUT_OF_DATE_TCB=1
DSTACK_SNP_SMOKE_KDS_PROXY_URL=https://cors.litgateway.com/

Production defaults still deny OutOfDate TCB and keep allowed_advisory_ids = [].

Image requirement

The working guest image was a coherent meta-dstack image built with:

MACHINE = "sev-snp"

Do not use the default TDX image for SNP smoke. A coherent PR image built with the default tdx machine produced a 6.18.24-dstack kernel with # CONFIG_AMD_MEM_ENCRYPT is not set; controlled QEMU tests showed that kernel resets immediately after OVMF loads kernel/initrd. SNP-capable kernels booted the same QEMU/OVMF path to Linux/SNP markers.

Also do not rely on ad-hoc dstack-util injection into a stock image. That changed measurement/boot behavior and regressed the boundary. For full app-key success, use a coherent meta-dstack image whose kernel/modules/initramfs/rootfs/verity metadata and guest userspace include the same PR branch.

Quote / attestation proof

Earlier guest quote proof confirmed the SNP guest can produce a hardware report containing the expected challenge bytes:

Memory Encryption Features active: AMD SEV SEV-ES SEV-SNP
SEV: SNP running at VMPL0.
sev-guest sev-guest: Initialized SEV guest driver (using vmpck_id 0)
DSTACK_SEV_SNP_ATTESTATION_PROOF_BEGIN
source=configfs-tsm
report_size=1184
report_data_offset=80
report_contains_expected_report_data=true
DSTACK_SEV_SNP_ATTESTATION_PROOF_END

The final KMS smoke additionally proves that the app guest's SNP evidence verifies through KMS/auth successfully enough to exercise strict denial and lab success release gates.

Measurement proof

A live golden-vector test on an SNP-capable host cross-checks dstack's pure Rust SNP measurement recomputation against sev-snp-measure:

cargo test -p dstack-kms --all-features recomputation_matches_sev_snp_measure_live_golden_vector -- --ignored --nocapture

Latest recorded proof:

DSTACK_SEV_SNP_MEASURE_GOLDEN_VECTOR_BEGIN
utc=2026-06-02T19:49:14Z
host=dedicated-m24-fork
sev_snp_measure=/usr/local/bin/sev-snp-measure
sev_snp_measure_version=sev-snp-measure 0.0.10
ovmf_sha256=67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a
vcpus=2
vcpu_type=EPYC-v4
guest_features=0x1
sev_snp_measurement=6497fb9f90dc4a322228a8a5eb14742e09067bc44c184c2068d583ef628b5bae8c6cf15d91fe1bc0b7a8cbcc575be370
cargo_live_test_result=passed locally on this host at 2026-06-02T19:49:14Z
DSTACK_SEV_SNP_MEASURE_GOLDEN_VECTOR_END

See docs/amd-sev-snp-review-readiness.md for the fuller proof block and review boundary.

Important implementation notes

Key fixes discovered during E2E smoke:

  • VMM .sys-config.json now includes sev_snp_measurement so KMS can recompute the same SNP launch measurement used by QEMU.
  • Released images may carry rootfs_hash only in kernel cmdline (dstack.rootfs_hash=...), so VMM/KMS now preserve and use that path.
  • KMS measurement recomputation preserves the original image cmdline before appending measured docker_compose_hash, rootfs_hash, and app_id.
  • SNP QEMU launch uses EPYC-v4 and confidential virtio PCI options (disable-legacy=on,iommu_platform=true).
  • Configfs TSM reports on the test host may omit ASK/VCEK collateral; verifier now fail-closed fetches AMD KDS ARK/ASK/VCEK by report chip_id + reported TCB when local evidence lacks cert collateral.
  • SNP guests skip TDX-only app-info / mr_config_id checks while preserving non-SNP behavior.
  • dstack-prepare.sh handles SNP guest detection, early chronyc unavailability, minimal smoke DNS fallback, and AMD KDS proxy propagation.
  • The smoke script supports DSTACK_SNP_SMOKE_KDS_PROXY_URL, configurable VMM ports/URL, port cleanup via fuser, better strict-TCB denial detection, and clearer KDS-blocked vs policy-denied logs.

Validation run

All passed locally on the final branch head:

bash -n test-scripts/snp-e2e-smoke.sh
cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo test -p dstack-vmm --all-features
cargo test -p ra-rpc --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run

Known limitations / follow-ups

  • platform = "auto" remains conservative while SNP is experimental. Operators must explicitly set platform = "amd-sev-snp".
  • This PR does not claim a production revocation/advisory feed. SNP reports/VCEKs do not directly expose an advisory-list field in the current evidence path, so advisory_ids is currently explicit and empty. Future advisory/revocation collateral should populate it and will be denied unless explicitly allowlisted.
  • AMD KDS fallback/proxy is implemented fail-closed. Production deployments should decide whether they need a trusted cache/proxy and configure it explicitly.
  • The hardware E2E smoke is manual, not CI; the repeatable manual script is checked in at test-scripts/snp-e2e-smoke.sh.
  • Full app success on a fresh box needs a coherent PR-built meta-dstack SNP guest image.
  • The lab host has tcbStatus = "OutOfDate"; success required an explicit lab allowlist. Production defaults still deny this.
  • Chipotle-specific Anvil RPC config is intentionally out of scope for this dstack PR.

Human review focus

Please pay special attention to:

  1. Fail-closed release semantics

    • SNP release disabled by default.
    • UpToDate only by default.
    • advisories denied unless allowlisted.
    • startup rejects release enablement without self-authorization.
  2. Measurement / identity binding

    • app_id, compose hash, rootfs hash, kernel/initrd/cmdline, OVMF, vCPU model, guest features, and optional smoke proxy cmdline are all part of recomputation or policy input.
    • app_id is launch-measured, not just auth metadata.
  3. AMD KDS collateral fallback/proxy

    • Report with no cert chain must not verify unless KDS collateral can be fetched and report signature/policy checks pass.
    • Network/KDS/proxy failure should fail closed.
    • Proxy support should stay explicit and measured when passed to the guest.
  4. Non-SNP regression risk

    • TDX/Nitro/GCP paths should continue through existing behavior.
    • SNP-specific skips should remain scoped to DstackAmdSevSnp.
  5. Operational policy choice

    • Whether to accept any non-UpToDate TCB in production should remain an explicit operator decision, not a default.

@clawdbot-glitch003
Copy link
Copy Markdown
Author

SEV-SNP TCB/advisory policy slice is pushed.

What changed:

  • VerifiedAmdSnpReport now carries verifier-derived AMD SNP TCB info from the signed report (current_tcb, reported_tcb, committed_tcb, launch_tcb).
  • KMS SNP BootInfo.tcb_status now comes from that verified report data instead of the old snp-verified-basic-policy placeholder.
    • maps to UpToDate only when current/reported/committed/launch TCB all match;
    • maps to OutOfDate otherwise, which stays denied by default.
  • VerifiedAmdSnpReport.advisory_ids is now explicit and propagated into KMS BootInfo; it is currently empty because the AMD report/VCEK evidence does not carry a direct advisory-list field.
  • The direct fake/default UpToDate SNP boot-info helper is now test-only; production goes through verified attestation.
  • auth-simple docs/tests now describe verifier-derived statuses instead of the placeholder and keep defaults strict: allowedTcbStatuses = ["UpToDate"], allowedAdvisoryIds = [].

Still fail-closed:

  • SNP key/cert release remains blocked for app keys, KMS keys, signing certs, and temp CA material.
  • Any non-UpToDate status or any advisory ID remains denied unless explicitly allowlisted.

Validation:

  • cargo fmt --all
  • cargo test -p dstack-kms --all-features
  • cargo test -p dstack-attest --all-features
  • cargo check --workspace --all-features
  • git diff --check
  • cd kms/auth-simple && npx oxlint . && npx vitest run
  • independent review: no blockers

@clawdbot-glitch003
Copy link
Copy Markdown
Author

Continued with the next quality-gate slice and pushed a small clippy cleanup commit.

Commit:

  • a0ff6efa chore: satisfy sev-snp workspace clippy

What changed:

  • removed a needless return in dstack attestation-mode detection without changing TDX/SNP selection semantics;
  • simplified KMS onboarding response error propagation (Ok(...?) -> direct Result return), preserving behavior;
  • derived Default for TeePlatform with Auto as the default variant, preserving the conservative default.

Validation now passing:

  • cargo fmt --all
  • cargo test -p dstack-kms --all-features
  • cargo test -p dstack-attest --all-features
  • cargo test -p dstack-vmm --all-features
  • cargo check --workspace --all-features
  • cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
  • git diff --check
  • prior auth-simple validation remains: cd kms/auth-simple && npx oxlint . && npx vitest run

Independent review of the cleanup diff found no behavior/security regressions.

@clawdbot-glitch003
Copy link
Copy Markdown
Author

Milestone 1 is done: PR #703 is now review-ready staging for AMD SEV-SNP, still without production key release.

New commit:

  • 93354eb6 docs: add sev-snp review readiness note

What changed:

  • Added docs/amd-sev-snp-review-readiness.md documenting:
    • exact review boundary;
    • fail-closed SNP key/cert release posture;
    • strict TCB/advisory defaults;
    • live sev-snp-measure golden-vector proof;
    • prior SNP guest attestation proof summary;
    • local validation commands.
  • Refreshed live golden-vector proof on dedicated-m24-fork at 2026-06-02T19:49:14Z:
    • ignored live test passed: cargo test -p dstack-kms --all-features recomputation_matches_sev_snp_measure_live_golden_vector -- --ignored --nocapture
    • measurement remains 6497fb9f90dc4a322228a8a5eb14742e09067bc44c184c2068d583ef628b5bae8c6cf15d91fe1bc0b7a8cbcc575be370

Validation passed after doc/proof refresh:

  • cargo fmt --all
  • cargo test -p dstack-kms --all-features
  • cargo test -p dstack-attest --all-features
  • cargo test -p dstack-vmm --all-features
  • cargo check --workspace --all-features
  • cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
  • git diff --check
  • cd kms/auth-simple && npx oxlint . && npx vitest run
  • independent review of the review-ready doc/code posture: no blockers

I am marking the PR ready for review now. Milestone 2 remains separate: production SNP key release policy + revocation/advisory collateral + guarded release enablement.

@clawdbot-glitch003 clawdbot-glitch003 marked this pull request as ready for review June 2, 2026 19:57
@clawdbot-glitch003
Copy link
Copy Markdown
Author

Milestone 2 is now implemented and pushed.

Commit: 6cb351f9 feat: enable guarded sev-snp key release

What changed:

  • Added local KMS [core.sev_snp_key_release] gate for AMD SEV-SNP key/cert material.
  • Default remains fail-closed: enabled = false, allowed_tcb_statuses = ["UpToDate"], allowed_advisory_ids = [].
  • Release requires both:
    1. verified SNP attestation + recomputed launch measurement + external auth API allow, and
    2. explicit local KMS release opt-in with acceptable TCB/advisory state.
  • Guarded all sensitive SNP release surfaces:
    • GetAppKey
    • GetKmsKey
    • SignCert
    • self-authorized GetTempCaCert
  • Added startup safety: KMS rejects sev_snp_key_release.enabled = true unless enforce_self_authorization = true, so temp-CA self-release cannot bypass SNP release checks in production config.
  • Updated kms/kms.toml and docs/amd-sev-snp-review-readiness.md with the opt-in release policy.

Validation passed:

cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run

Independent security review: no release-gate blockers found after the self-authorization startup-safety fix.

@clawdbot-glitch003
Copy link
Copy Markdown
Author

SNP E2E smoke follow-up

I kept going on the manual SNP smoke on chris@173.234.27.162 and pushed the fixes/docs in fe08b86f fix: bind sev-snp vm launch inputs.

What the smoke found/fixed:

  • VMM .sys-config.json now includes sev_snp_measurement so KMS SNP BootInfo recomputation has the same launch inputs QEMU used.
  • VMM now accepts released image metadata where rootfs_hash is only present as dstack.rootfs_hash=... in the kernel cmdline.
  • SNP QEMU launch now uses EPYC-v4 and confidential virtio PCI options (disable-legacy=on,iommu_platform=true) for SNP-launched virtio devices.

Smoke status:

  • Tested dstack-0.5.11 and dstack-dev-0.5.11 with PR-built dstack-vmm/supervisor/dstack-kms, QEMU 10.0.2, and SNP OVMF.
  • Both SNP runs reached OVMF loading the measured kernel/cmdline/initrd path and emitted:
    • EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
  • Neither completed Linux/userspace boot before timeout, so the full dstack-managed guest -> KMS GetAppKey hardware E2E is still blocked before KMS userspace/app-key exercise.
  • Control check: the same dstack-dev-0.5.11 kernel/initrd/rootfs boots without SNP and reaches dstack Guest Preparation Service, narrowing the blocker to SNP+OVMF direct-kernel boot compatibility rather than KMS release policy.
  • No key/secret material was returned.

Validation passed after the fixes:

cargo fmt --all
cargo test -p dstack-vmm --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-attest --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run

@clawdbot-glitch003
Copy link
Copy Markdown
Author

AMD SEV-SNP manual E2E smoke update

I pushed a follow-up commit that completes the dstack-managed SNP smoke path:

  • Commit: 0a08253a fix: complete sev-snp key release smoke path
  • Smoke host: chris@173.234.27.162
  • QEMU: 10.0.2
  • OVMF: /opt/AMDSEV/usr/local/share/qemu/OVMF.fd (67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a)
  • Image: dstack-dev-0.5.11-snp-dnsfix

What the smoke proved

  • KMS SNP guest booted Linux/userspace and started dstack-kms.
  • App SNP guest booted Linux/userspace and requested app keys from KMS.
  • KMS self auth and app auth both succeeded through auth-simple:
    • /bootAuth/kms -> 200
    • /bootAuth/app -> 200
  • App guest reached GetTempCaCert and GetAppKey against the SNP-backed KMS.
  • KMS metrics after app request:
    • dstack_kms_attestation_requests_total 1
    • dstack_kms_attestation_failures_total 0

Failure gate also exercised

The lab host reports verifier-derived tcbStatus = "OutOfDate". With the default strict release policy (allowed_tcb_statuses = ["UpToDate"]), the app guest was denied as expected:

error: "tcb_status is not allowed"

Then, with an explicit lab-only allowlist (["UpToDate", "OutOfDate"]), the same flow succeeded. Production defaults remain fail-closed.

Fixes included

  • Preserve the released image's original kernel cmdline in SNP measurement recomputation, then append measured docker_compose_hash, rootfs_hash, and app_id exactly like the VMM launch path.
  • Include base_cmdline in VMM-provided sev_snp_measurement input.
  • Add AMD KDS fallback for SNP reports that do not carry cert collateral: fetch ARK/ASK/VCEK from KDS using report chip_id + reported TCB and verify fail-closed.
  • Add configfs TSM -> extended-report ioctl fallback for cert-chain collection.
  • Let SNP guests skip TDX-only app-info / mr_config_id checks while preserving non-SNP behavior.
  • Make dstack-prepare.sh robust for SNP smoke boots (sev-guest detection, early chronyc tolerance, DNS fallback).

Validation run

All passed locally:

cargo fmt --all
cargo test -p dstack-attest --all-features
cargo test -p dstack-util --all-features
cargo test -p dstack-kms --all-features
cargo test -p dstack-vmm --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check
cd kms/auth-simple && npx oxlint . && npx vitest run

No secret/key material was included in logs or this comment.

@kvinwang
Copy link
Copy Markdown
Collaborator

kvinwang commented Jun 4, 2026

Thank you so much for this — it's a huge and impressively thorough piece of work. 🙏

I have some other things on my plate right now, but I'll review this once I'm through them. Thanks again!

@clawdbot-glitch003
Copy link
Copy Markdown
Author

Fresh-box SNP smoke update (sanitized):

  • Built and ran a coherent meta-dstack dev image with the PR branch wired into guest userspace and MACHINE = "sev-snp".
  • Confirmed why the earlier coherent image still reset: the default tdx machine build produced a dstack kernel without AMD memory-encryption/SNP support. The SNP machine build boots under QEMU 10.0.2 + SNP OVMF.
  • Latest smoke reached:
    • Linux/userspace boot
    • dstack Guest Preparation Service
    • SNP_KMS_CONTAINER_STARTED
    • KMS /metrics readiness
    • app guest Requesting app keys from KMS
    • GetTempCaCert
    • app GetAppKey request boundary
  • Current remaining blocker is external AMD KDS collateral fetch throttling, not guest boot/KMS startup/release-policy wiring:
    • app GetAppKey failed while fetching AMD SEV-SNP VCEK collateral from kdsintf.amd.com
    • observed HTTP 429 for Genoa VCEK request
  • Updated docs/script to make the fresh-box requirements explicit and to avoid overclaiming before KDS collateral fetch completes.

Validation after doc/script update:

bash -n test-scripts/snp-e2e-smoke.sh
cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-vmm --all-features
cargo test -p dstack-attest --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check

Commit: 24d61e74 docs: clarify sev-snp fresh-box smoke

No secrets or credential material included in this update.

@clawdbot-glitch003
Copy link
Copy Markdown
Author

Fresh-box SNP smoke follow-up pushed in 45c77779 (docs: record sev-snp smoke gate boundary).

What changed:

  • Updated test-scripts/snp-e2e-smoke.sh so the strict TCB probe actually drives an app guest to the strict KMS GetAppKey path instead of only waiting for KMS startup.
  • Added separate strict/success KMS host ports so the failure and success probes can run in one smoke without port collision.
  • Fixed the app deploy helper so captured VM IDs stay clean; compose/deploy diagnostics now go to artifacts/stderr instead of contaminating command substitution.
  • Updated docs/amd-sev-snp-review-readiness.md with the latest coherent MACHINE = "sev-snp" image result and the exact remaining boundary.

Remote smoke evidence from chris@173.234.27.162 using QEMU 10.0.2, SNP OVMF 67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a, and coherent dstack-dev-0.6.0 image:

  • KMS guest boots Linux/userspace and reaches SNP_KMS_CONTAINER_STARTED.
  • Strict probe app guest reaches dstack-prepare.sh, detects SEV-SNP, and requests app keys from strict KMS at GetTempCaCert / GetAppKey.
  • Success probe app guest reaches the same GetTempCaCert / GetAppKey request boundary against lab-allowlisted KMS.
  • Both probes are currently blocked before final strict-denial/success markers by external AMD KDS collateral fetch throttling: Genoa/Milan VCEK requests return HTTP 429; other product fallbacks return expected nonmatching-product 404s.

So the remaining gap is not guest boot, VMM wiring, KMS startup, or release-policy plumbing. The current blocker is external AMD KDS collateral availability/rate-limiting for the app quote.

Validation passed after this update:

bash -n test-scripts/snp-e2e-smoke.sh
git diff --check
cargo fmt --all
cargo test -p dstack-kms --all-features
cargo test -p dstack-vmm --all-features
cargo test -p dstack-attest --all-features
cargo check --workspace --all-features
cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
git diff --check

No secret material is included in the docs or PR evidence.

@clawdbot-glitch003
Copy link
Copy Markdown
Author

Update: completed the fresh-box SNP smoke through both gates and pushed the follow-up fix.

What changed in the latest commit (b9d968de, fix: complete sev-snp smoke proxy path):

  • Added controlled AMD KDS proxy support for SNP collateral fetches (DSTACK_AMD_KDS_PROXY_URL) so lab runs can avoid AMD KDS HTTP 429 throttling without weakening attestation policy.
  • Propagated the proxy through:
    • guest cmdline / dstack-prepare.sh / /run/dstack/environment for guest services,
    • KMS config startup before attestation verification,
    • RA-RPC quote verification before per-request cert/quote validation.
  • Mirrored the proxy cmdline fragment in VMM measured launch and KMS measurement recomputation, so enabling the smoke proxy does not create a measurement mismatch.
  • Hardened test-scripts/snp-e2e-smoke.sh with separate ports/VMs, KMS-log-aware strict probe checks, and reusable proxy configuration.
  • Updated docs/amd-sev-snp-review-readiness.md with the final sanitized smoke boundary.

Remote smoke result on chris@173.234.27.162 with coherent MACHINE = "sev-snp" dstack-dev-0.6.0 image, QEMU 10.0.2, and SNP OVMF sha 67e7a7027437823e9c166a60d00666d5d5391e13050488cad5cc2acd913fab4a:

strict_tcb_probe=denied_as_expected with "tcb_status is not allowed"
success_probe=GetTempCaCert HTTP 200; GetAppKey HTTP 200; SignCert HTTP 200; app container started
smoke_result=SNP E2E smoke success
no_secret_material_logged=true

Validation rerun after the final patch:

  • bash -n test-scripts/snp-e2e-smoke.sh
  • cargo fmt --all
  • cargo test -p dstack-attest --all-features amd_kds_proxy_url_wraps_amd_urls_when_configured -- --nocapture
  • cargo test -p ra-rpc --all-features quote_verifier_carries_trimmed_amd_kds_proxy_url -- --nocapture
  • cargo test -p dstack-vmm --all-features amd_sev_snp -- --nocapture
  • cargo test -p dstack-kms --all-features
  • cargo test -p dstack-attest --all-features
  • cargo test -p dstack-vmm --all-features
  • cargo test -p ra-rpc --all-features
  • cargo check --workspace --all-features
  • cargo clippy --workspace --all-features -- -D warnings --allow unused_variables
  • git diff --check
  • cd kms/auth-simple && npx oxlint . && npx vitest run

No secrets or key material were included in the smoke artifacts or PR note.

Copy link
Copy Markdown
Contributor

Published the AMD SEV-SNP support design and production-readiness tracker here: #713

This separates the broader design discussion from the PR implementation review. The issue captures:

  • the intended boundary for feat: stage AMD SEV-SNP attestation support #703 as explicit-opt-in bare-metal SNP support;
  • the production blockers: AMD root pinning, KDS fetching/caching, KMS binding parity, and ACPI/BadAML mitigation;
  • the platform strategy for bare metal, GCP, Azure, and AWS;
  • the decisions and open questions before SNP should become automatic or production-ready.

@clawdbot-glitch003
Copy link
Copy Markdown
Author

Applied the Chipotle-agent feedback to the PR docs/body without pulling Chipotle-specific app config into dstack scope.

Updates pushed in 64a33e8a:

  • Clarified the AMD KDS proxy shape and smoke env:
    • DSTACK_SNP_SMOKE_KDS_PROXY_URL=https://cors.litgateway.com/
    • runtime export as DSTACK_AMD_KDS_PROXY_URL
    • path-prefix passthrough: https://cors.litgateway.com/https://kdsintf.amd.com/..., not ?url=
  • Clarified that the final smoke is no longer blocked at AMD KDS 429 when the proxy is enabled.
  • Documented that lab success used DSTACK_SNP_SMOKE_ALLOW_OUT_OF_DATE_TCB=1, while production defaults remain UpToDate only with an empty advisory allowlist.
  • Kept the coherent image requirement explicit: build/use MACHINE = "sev-snp"; default TDX images can miss CONFIG_AMD_MEM_ENCRYPT and reset after OVMF loads kernel/initrd.
  • Updated validation docs to include bash -n test-scripts/snp-e2e-smoke.sh and cargo test -p ra-rpc --all-features.
  • Rewrote the PR body so it reflects the latest successful managed SNP smoke: KMS ready, strict TCB denial, permissive lab GetTempCaCert / GetAppKey / SignCert success, and app container startup.

Validated before push:

bash -n test-scripts/snp-e2e-smoke.sh
cargo fmt --all --check
git diff --check

Chipotle-specific note deliberately left out of dstack implementation/docs except as out-of-scope in the PR body: the app needed ANVIL_CHAIN_RPC=http://10.0.2.2:8545 / Chain::Anvil handling fixed on the Chipotle side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants