Skip to content

docs(spec): Bicep-less Foundry agent init design (#8065)#8577

Open
hund030 wants to merge 6 commits into
mainfrom
spec/bicepless-foundry
Open

docs(spec): Bicep-less Foundry agent init design (#8065)#8577
hund030 wants to merge 6 commits into
mainfrom
spec/bicepless-foundry

Conversation

@hund030

@hund030 hund030 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Design spec for RFC #8065 — make azd ai agent init Bicep-less by default, with the azure.ai.agents extension owning provisioning via a custom provider named microsoft.foundry.

Summary

Moves infrastructure templates from Azure-Samples/azd-ai-starter-basic into the azure.ai.agents extension binary. azd ai agent init produces only azure.yaml and an agent code project — no infra/ directory. At azd provision time, the extension's own provisioning provider synthesizes Bicep in memory from azure.yaml and applies it. azd ai agent init --infra ejects on demand: same synthesis, written to ./infra/; subsequent provisions read from disk.

The azure.yaml shape is fixed by the Foundry azure.yaml reference: a single host: microsoft.foundry service per project, with nested agents:, deployments:, connections:, etc. This spec only changes how that file is provisioned; it does not redesign the YAML.

Key design decisions

  • host: microsoft.foundry — one consolidated service per Foundry project, with nested agents[], deployments[], connections[], toolboxes[], skills[], routines[]. Matches the reference doc.
  • infra.provider: microsoft.foundry — explicit declaration required until service-host-driven auto-routing lands. Reuses the custom provisioning provider framework from feat: Add provisioning provider support to extension framework #7482 (merged).
  • Brownfield via endpoint: URL — not ARM resource ID. Matches what Portal and az CLI show; deploy verb resolves ARM ID from endpoint when it needs control-plane access.
  • Three deploy modes per agent: docker:, runtime:, or image: (pre-built container). Exactly one is required; validator rejects two-or-none.
  • On-disk reuse after eject: extension reads ./infra/ when present, synthesizes otherwise. azure.yaml is never mutated by eject.
  • No --force flag for eject. To regenerate, the user deletes ./infra/ and re-runs the command.
  • Preview via ARM What-If — mirrors Core's Bicep provider (scope.go:132 uses WhatIfDeployToResourceGroup).

Scope

In scope: Bicep-less default behavior, eject command, embedded templates, ARM-backed synthesis only (Foundry project + model deployments + ACR when needed), schema relaxation to allow extension-named providers in infra.provider.

Out of scope:

  • azd deploy — agent code push and data-plane reconciliation (connections, toolboxes, skills, routines, agent definitions) are deploy's job, not provisioning's. This spec ends at "ARM resources are in place."
  • Data-plane YAML fieldsconnections:, toolboxes:, skills:, routines:, agent-level tools:/skill:, $ref: resolution. Synthesizer reads them only to skip them; deploy verb owns them.
  • Service-host-driven auto-routing (RFC Core Ask 1) — would remove the explicit infra.provider:. Defers to a future spec.
  • Unified azure.yaml schema (Unify Foundry agent configuration in azure.yaml #7962); incremental composition (Add connections, models, tools, and skills to Foundry Agent projects after init #8049); coexistence with non-Foundry services (use infra.layers[] as escape hatch).

Core changes collapsed. The original RFC asked Core to surface services.<svc>.uses and a typed services.<svc>.runtime on the extension-facing proto. With nested agents[], runtime lives inside the service body (read via additional_properties); no proto/struct/mapper plumbing needed. Down to one Core change: relax the infra.provider enum.

Related

Notes for reviewers

Doc-only PR. Adds docs/specs/bicepless-foundry/spec.md. No code changes — the spec is the implementation contract; code PRs follow.

Particular attention welcomed on:

  1. The "What the synthesizer ignores" table — is the data-plane / ARM split drawn at the right line?
  2. The Open Question and its proposed answer (drift detection on Deploy()).
  3. The eject "delete-and-rerun" model with no --force flag — does this match azd's broader UX posture?

Design spec for RFC #8065 — make `azd ai agent init` Bicep-less by

default with the `azure.ai.agents` extension owning provisioning.

Adopts the compromise of explicit `infra.provider: azure.ai.agents`

in azure.yaml (per PR #7482's custom provisioning provider framework),

deferring service-host-driven auto-routing to v0.3+. Covers in-memory

synthesis, on-disk reuse after eject, brownfield `resourceId:` flow,

5-step validation pipeline, and the small Core changes required to

surface `uses` / `runtime` on the extension-facing `ServiceConfig`.
@hund030 hund030 requested a review from therealjohn June 9, 2026 10:26
@hund030 hund030 marked this pull request as ready for review June 9, 2026 10:26
Copilot AI review requested due to automatic review settings June 9, 2026 10:26
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🔗 Linked Issue Required

Thanks for the contribution! Please link a GitHub issue to this PR by adding Fixes #123 to the description or using the sidebar.
No issue yet? Feel free to create one!

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design spec documenting the proposed “Bicep-less by default” experience for azd ai agent init, where the azure.ai.agents extension owns provisioning via a custom provisioning provider and can optionally “eject” generated Bicep to ./infra/.

Changes:

  • Introduces a detailed design spec for extension-owned provisioning + in-memory Bicep synthesis.
  • Defines activation/eject behavior, validation pipeline, and post-eject drift/UX expectations.
  • Outlines required (future) Core schema/proto surfacing for uses and a service-level runtime.

Comment thread docs/specs/bicepless-foundry/spec.md
Comment thread docs/specs/bicepless-foundry/spec.md Outdated
Comment thread docs/specs/bicepless-foundry/spec.md Outdated
Comment thread docs/specs/bicepless-foundry/spec.md Outdated
Comment thread docs/specs/bicepless-foundry/spec.md Outdated
hund030 added 4 commits June 10, 2026 14:17
Review-pass revisions on docs/specs/bicepless-foundry/spec.md:

- Fix Core Changes #1: Uses already exists on ServiceConfig (service_config.go:58) and the v1 schema (azure.yaml.json:234); the gap is proto-only. Runtime remains the larger gap. Rewrote the section narrative and trimmed the file-change table accordingly.

- Fix On-Disk Reuse table: azdFileShareUploadOperations is at infra/provisioning/manager.go:125, not :121. Disambiguated all four rows with full paths and 'call'/'gate' suffix.

- Fix Core Changes mapper range: mapper_registry.go:102-162 (was :139-161, which was a sub-slice).

- Remove all v0.2/v0.3+/0.1.x/0.2.x version markers; the spec doesn't own a release schedule.

- Trim Problem, Solution, Provider Resolution, Explicit Declaration, Brownfield, In-Memory Synthesis, and Embedded Templates sections; remove duplicate eject example; consolidate Post-Eject trade paragraph.

- Open Questions: drop --preview-bicep entry, add one-line proposals to the remaining two.
…ecision)

- Preview: replace 'Same as Deploy with validationOnly mode' with 'ARM What-If, mirrors Core's Bicep provider'. validationOnly hits ARM's /validate endpoint and returns template-validity errors; What-If hits /whatIf and returns a real change diff. Core's bicep provider uses WhatIfDeployToResourceGroup (cli/azd/pkg/infra/scope.go:132); the spec now matches.

- pathHasModule row: clarify that os.ReadDir returns NotExist on missing ./infra/, and the caller's 'err == nil && moduleExists' guard is what falls through. Prior wording 'returns false' was imprecise.
…y host, nested agents)

Adopts the consolidated YAML shape from
therealjohn/foundry-azd-config-preview/REFERENCE.md and trims the spec
accordingly.

Shape changes:
- Host: azure.ai.project + azure.ai.agent (two services, uses: link) ->
  microsoft.foundry (single service with nested agents[]).
- Provider name: azure.ai.agents -> microsoft.foundry (matches host
  kind, reads like an engine next to bicep/terraform).
- Brownfield signal: resourceId (ARM ID) -> endpoint (URL, matches
  Portal/CLI UX).
- Deploy modes: added image: as third option alongside docker:/runtime:.

Scope tightening:
- azd deploy explicitly out of scope (agent code push and data-plane
  reconciliation are deploy's job).
- Data-plane fields (connections, toolboxes, skills, routines,
  agent-level tools/skill, $ref) silently ignored by synthesizer; new
  field-skip table makes this explicit.
- Coexistence with non-Foundry services out of scope; infra.layers[]
  noted as escape hatch.

Core changes collapsed:
- Removed "Surface uses/runtime to extensions" as a Core ask. With
  nested agents[], the runtime is inside the service body (read via
  additional_properties); no proto/struct/mapper plumbing needed.
- Down to two Core changes: relax infra.provider enum, and the deferred
  auto-install (#7502).

Validation pipeline rewritten to match the new invariants. Per-agent
deploy-mode check now allows exactly one of docker/runtime/image.
Brownfield validation checks endpoint URL shape. Foundry server-side
templating syntax pass-through made explicit.
- Remove Core Changes section on auto-install (#7502 — already delivered
  by #7482; not a gap, not in our scope).
- Drop Open Question 2 (schema branch ownership with #7962). It was a
  coordination artifact, not a real dependency.
- Drop #7962 and #8049 References entries.
- Drop the forward reference to `azd ai agent add monitoring` (per #8049)
  from Embedded templates — monitoring is out of scope.
- Drop the auto-install Risks row.

Keeps #7962 and #8049 only as one-line out-of-scope pointers in the
Scope section.
@glharper

Copy link
Copy Markdown
Member

Spec review

Strong, well-grounded spec. I verified the code citations against main and they''re all accurate: ParseProvider pass-through (provisioning.go:53-57), provider resolution unchanged (manager.go:505-540), detectProviderFromFiles gated on NotSpecified (importer.go:304), ProjectInfrastructure never inspecting service.Host (importer.go:292-358), the azdext.ProvisioningProvider interface methods (:23-36) matching the method table, schema enum at azure.yaml.json:44-52, and registration parity between default.go:81-86 and provisioning_service.go:138-152. Implementation-ready.

The three questions you flagged

1. Is the data-plane/ARM split in "What the synthesizer ignores" drawn correctly?
Mostly, but there''s an internal inconsistency. The table lists agents[].runtime: as ignored ("deploy verb''s job"), yet Validation step 3 requires the synthesizer to read docker:/runtime:/image: to enforce the exactly-one deploy-mode invariant, and docker:->ACR / image:->no-ACR are explicitly read for branching. So those three aren''t "ignored" - they''re read for validation and ARM branching but emit no ARM of their own. Suggest splitting the table into read-for-branching-but-no-ARM (docker, image, runtime) vs not-read-at-all (connections, toolboxes, skills, routines, tools, skill, protocols, env, startupCommand). As written the table contradicts the validation section.

2. Drift detection on Deploy() - is "no detection" right?
This is the weakest part of the proposal. The justification ("CLI add commands already warn at the entry point") only covers mutations made through azd commands. A user hand-editing azure.yaml (e.g. adding a second container agent -> needs ACR) bypasses every add warning, then azd provision reads stale on-disk Bicep, silently skips the ACR, and the failure surfaces much later in azd deploy (docker push) with a confusing error. A cheap, non-invasive guard at Deploy() - diff the set of ARM-relevant fields in azure.yaml against what the on-disk template declares, and warn (not block) - would catch this without re-introducing auto-merge. I''d push back on "no detection" and at least make it a one-line warn.

3. Eject delete-and-rerun with no --force?
Refusing to clobber user-owned files matches azd''s posture. But a hard refuse is slightly more friction than azd''s usual style - azd typically prompts on overwrite rather than erroring out. Consider an interactive confirm ("./infra/ exists, regenerate and overwrite? [y/N]") with the hard-refuse behavior preserved under --no-prompt/CI. Keeps the safety while feeling more native. The all-or-nothing scope is right.

Other findings

  • Destroy is under-specified and risks a real bug. "Delete resource group or use deployment stacks" omits soft-delete purge. Foundry/Cognitive Services accounts (and any Key Vault) are soft-delete-protected; Core''s Bicep provider handles purge. Without it, up->down->up under the same name fails on the second provision. Should be called out so the extension''s Destroy matches Core parity.

  • Schema relaxation loses typo-catching. Changing enum: ["bicep","terraform"] -> examples means provider: biceps is no longer flagged in IDE validation for anyone, not just Foundry users. Since extension provider names are open-ended an enum can''t enumerate them - but consider a pattern (e.g. ^[a-z0-9.]+$) alongside examples to retain some validation, or note the accepted regression explicitly.

  • Brownfield + Preview (What-If) needs scope resolution too. The spec says ARM-ID-from-endpoint: resolution is deploy-time, but What-If also needs the target scope (RG/subscription). Confirm Preview resolves endpoint->scope, not just Deploy.

  • Telemetry doc requirement. Per repo convention (cli/azd/AGENTS.md), new fields (provision.synthesis_source, init.infra_flag) must be added to docs/reference/telemetry-data.md. Worth naming this file as a required deliverable in the implementation PRs.

  • infra.layers[] escape-hatch is asserted but unverified. The coexistence story relies on per-layer provider scoping, but importer.go:316 short-circuits on len(Layers) > 0 and the spec doesn''t confirm an InfraLayer can name an extension provider. Worth a one-line verification since it''s the only mixed-project answer.

  • Minor: Stability Contract ("semantically identical") vs Test Plan ("byte-equal output") - clarify byte-equal is per extension version. And the infra.provider: line added by init becomes dead once auto-routing lands; note a cleanup path.

@hund030

hund030 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

@glharper Thanks for the careful review — all 8 points hold up and have been applied to the spec:

  1. Synthesizer ignores table — split into "read for branching" (docker/runtime/image) and "not read at all" (everything else). runtime was the only contradiction; docker/image weren't actually in the table.
  2. Drift detection — Open Question 1 flipped to warn (don't block) on every on-disk Deploy(). Pseudocode and method-table row updated.
  3. Eject UX — now matches azd infra generate: interactive [y/N] prompt, --no-prompt keeps the hard-refuse for CI.
  4. Destroy soft-delete purge — row updated to mirror Core's getCognitiveAccountsToPurge + purgeItems (bicep_provider.go:1283-1413); up → down → up failure mode called out.
  5. Schema typo regressionpattern: ^[a-z0-9.]+$ + examples instead of examples alone.
  6. Brownfield + Preview scope — both Deploy and Preview now resolve endpoint: to ARM ID + scope before invoking ARM.
  7. Telemetry doc requirement — one-line callout naming docs/reference/telemetry-data.md per cli/azd/AGENTS.md:246-249.
  8. infra.layers[] escape hatch — verified inline: InfraLayer.Provider (provider.go:57:40) + ParseProvider accepts any string.
  9. Stability vs Test Plan wording — tightened to byte-stable within a patch extension version, matching the test.

…schema, scope)

All 8 substantive points from the maintainer review applied:

- Split "What synthesizer ignores" into two tables: read-for-branching
  (docker/runtime/image, needed by validation step 3 and ARM branching)
  vs not-read-at-all (data-plane). Resolves the runtime contradiction.

- Open Question 1 flipped from "no detection" to warn-on-Deploy().
  Pseudocode + method-table row now describe the in-memory diff against
  on-disk Bicep.

- Eject UX now matches azd infra generate (cmd/infra_generate.go:204-210):
  interactive overwrite prompt, --no-prompt keeps hard-refuse for CI.
  Post-Eject CLI table and Accepted-trade paragraph updated.

- Destroy row spells out soft-delete purge of Cognitive Services accounts
  to mirror Core's bicep provider (bicep_provider.go:1283-1413). Without
  this, up -> down -> up under the same name fails.

- Schema relaxation now pattern: ^[a-z0-9.]+$ + examples, not examples
  alone. Keeps typo catching for all users.

- Brownfield section + Preview row: both Deploy and Preview now resolve
  endpoint -> ARM ID + target scope before invoking ARM. Preview can't
  run on a brownfield project without scope resolution.

- Telemetry section names docs/reference/telemetry-data.md as an
  implementation-PR deliverable per cli/azd/AGENTS.md:246-249.

- infra.layers[] escape hatch verified inline: InfraLayer.Provider field
  (provisioning/provider.go:57 -> :40) + ParseProvider accepts any string.

- Stability Contract tightened from "semantically identical" to
  "byte-stable within a patch extension version / byte-identical Bicep,"
  matching the Test Plan's byte-equal standard.

Test Plan picked up entries for each new behavior: schema pattern,
eject overwrite prompt, post-eject Deploy() drift warn, brownfield
Preview scope, and an expanded init -> provision -> down -> provision
E2E for soft-delete purge.
@hund030

hund030 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the careful review — all 8 points hold up and have been applied to the spec:

  1. Synthesizer ignores table — split into "read for branching" (docker/runtime/image) and "not read at all" (everything else). runtime was the only contradiction; docker/image weren't actually in the table.
  2. Drift detection — Open Question 1 flipped to warn (don't block) on every on-disk Deploy(). Pseudocode and method-table row updated.
  3. Eject UX — now matches azd infra generate: interactive [y/N] prompt, --no-prompt keeps the hard-refuse for CI.
  4. Destroy soft-delete purge — row updated to mirror Core's getCognitiveAccountsToPurge + purgeItems (bicep_provider.go:1283-1413); up → down → up failure mode called out.
  5. Schema typo regressionpattern: ^[a-z0-9.]+$ + examples instead of examples alone.
  6. Brownfield + Preview scope — both Deploy and Preview now resolve endpoint: to ARM ID + scope before invoking ARM.
  7. Telemetry doc requirement — one-line callout naming docs/reference/telemetry-data.md per cli/azd/AGENTS.md:246-249.
  8. infra.layers[] escape hatch — verified inline: InfraLayer.Provider (provider.go:57:40) + ParseProvider accepts any string.
  9. Stability vs Test Plan wording — tightened to byte-stable within a patch extension version, matching the test.

Test Plan also picked up entries for each new behavior. Pushing the commit shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants