Skip to content

feat: allow templating per-agent reasoning.effort and context_tier (#262)#263

Merged
jrob5756 merged 3 commits into
microsoft:mainfrom
franklixuefei:xuefl/templatable-reasoning-enums
Jun 24, 2026
Merged

feat: allow templating per-agent reasoning.effort and context_tier (#262)#263
jrob5756 merged 3 commits into
microsoft:mainfrom
franklixuefei:xuefl/templatable-reasoning-enums

Conversation

@franklixuefei

Copy link
Copy Markdown
Member

Summary

Allows per-agent reasoning.effort and context_tier to be templated with {{ workflow.input.* }} (and {% %} statement) Jinja2 expressions, resolved at runtime. Closes #262.

Both fields were strict pydantic Literal enums, so a templated value like effort: "{{ workflow.input.eff }}" was rejected at YAML load — before --input / Jinja2 templates are resolved. model: already escaped this because it is a free-form str resolved at runtime.

agents:
  - name: thinker
    reasoning:
      effort: "{{ workflow.input.effort }}"   # now accepted; resolved + validated at runtime
    context_tier: "{{ workflow.input.tier }}"

Approach: defer enum validation to runtime (issue Approach 2)

The issue marks Approach 1 ("resolve templates before enum validation") as preferred, but after reading the code Approach 2 is the idiomatic, lower-risk fit — and it mirrors a shipped, load-bearing precedent in the same file: the wait step's duration (templatable-but-validated via _validate_wait_duration + runtime render in executor/wait.py).

Why Approach 1 was rejected (grounded in the code):

  • Inputs aren't available at load. --input values + workflow.input defaults are merged later in the engine, not in config/loader.py.
  • conductor validate has no inputs at all, so a {{ workflow.input.x }} could never resolve during validate — you'd still need a "defer unresolved template" path.
  • Chicken-and-egg: workflow.input defaults live inside the same YAML that model_validate parses.
  • Scoping a pre-resolution to "only workflow.input, only these enum fields" converges right back to Approach 2's surgery anyway.

Changes

Schema (config/schema.py)

  • Widen ReasoningConfig.effortReasoningEffort | str + a @model_validator(mode="after") that defers {{/{% templates and validates literals via get_args.
  • Widen AgentDef.context_tierContextTier | str | None + _validate_context_tier() dispatched from the regular-agent branch of validate_agent_type. Non-agent step types still reject the field outright (a template string is still "not None").

Runtime (executor/agent.py)

  • After the existing model render, render + validate templated reasoning.effort / context_tier with the full agent context. Every execution path (main / retry / dialog / for_each) routes through AgentExecutor.execute, so this single chokepoint covers them all. A _render_enum_field helper renders, strips trailing whitespace, validates the resolved literal, and raises a clear ValidationError on a bad resolution. The provider then sees a concrete enum — no provider changes.

Providers (providers/reasoning.py, providers/context_tier.py)

  • resolve_* cast(...) the widened field back to the strict enum — by the time they run (provider execute, after the executor resolves the field) the value is concrete and validated.

Validator (config/validator.py)

  • The reasoning.effort capability cross-check defers only the value-dependent membership check for templates. The value-independent "provider supports no reasoning at all" check (reasoning_effort=None) still fires for templated efforts — otherwise a templated effort on a no-reasoning provider would pass validate while a literal errors.

Rigor

Pre-impl + dual-model post-impl rubber-duck (Claude Opus 4.8 + GPT-5.5, run in parallel). Both independently flagged the validator capability gap above (fixed). Opus additionally caught a schema/executor asymmetry — the schema deferred only on {{ while the executor renders {{ or {%, so a {% if %} template was rejected at load though the executor would render it; both schema validators now mirror the executor guard. Verified end-to-end: {% if workflow.input.heavy %}xhigh{% else %}low{% endif %} resolves to xhigh/low per input.

Tests

  • Schema: templated effort/context_tier deferred (incl. {% %} statements); literal-invalid still rejected; human_gate/script/workflow/etc still reject templated fields.
  • Runtime: render for both fields, static pass-through, no-mutation of the original templated agent, bad-resolution raises, trailing-newline strip.
  • Validator: templated effort defers the membership check; templated effort on a no-reasoning provider still errors.
  • CLI: conductor validate accepts a templated-enum workflow (exit 0), new fixture valid_templated_enums.yaml.

Verification

  • ruff check + ruff format --check: clean
  • ty check src: 12 diagnostics, == baseline (0 introduced; all pre-existing Windows termios / dialog stubs)
  • pytest: 3526 passed; the 13 failures are identical on a clean upstream/main checkout (pre-existing Windows-env: registry TOML, event_log non-serializable, script stdin tty) — 0 new failures.

Out of scope (deliberate non-goals)

  • Workflow-level runtime.default_reasoning_effort / default_context_tier templating — the AC is per-agent only; these remain strict enums (a templated workflow-level default is rejected with a clear message). Can follow up separately.
  • A generalized Annotated[...] "templatable enum" type — starting with | str + the duration-style validator to match existing code; generalize only if a third enum field needs it.

A minor cosmetic note: the workflow_started telemetry event reports the raw (unrendered) per-agent template, since per-for_each-item resolutions differ. This dict is telemetry-only (never sent to a provider), so there's no correctness impact — flagging it as a conscious choice.

…icrosoft#262)

Per-agent `reasoning.effort` (low/medium/high/xhigh) and `context_tier`
(default/long_context) were strict pydantic `Literal` enums, so a templated
value like `effort: "{{ workflow.input.eff }}"` was rejected at YAML load —
before `--input` / Jinja2 templates are resolved. `model:` already escaped
this because it is a free-form `str` resolved at runtime.

This implements the issue's Approach 2 (defer enum validation to runtime),
mirroring the existing, shipped precedent for the `wait` step's `duration`
field (templatable-but-validated). Approach 1 (resolve templates before
enum validation in the loader) was evaluated and rejected: the loader has
no `--input`/`workflow.input` defaults available, `conductor validate` has
no inputs at all, and pre-resolving collapses back into Approach 2's
surgery anyway. See PR description for the full grounded comparison.

## Changes

Schema (src/conductor/config/schema.py)
- Widen `ReasoningConfig.effort` to `ReasoningEffort | str` and add a
  `@model_validator(mode="after")` that defers `{{`/`{%` templates and
  validates literals against `get_args(ReasoningEffort)`.
- Widen `AgentDef.context_tier` to `ContextTier | str | None` and add
  `_validate_context_tier()` dispatched from the regular-agent branch of
  `validate_agent_type` (non-agent types still reject the field outright —
  a template string is still "not None").
- Docstrings advertise template support, mirroring the `model` docstring.

Runtime (src/conductor/executor/agent.py)
- After the existing `model` render, render + validate templated
  `reasoning.effort` / `context_tier` with the full agent context (covers
  main / retry / dialog / for_each — every path routes through
  `AgentExecutor.execute`). A `_render_enum_field` helper renders, strips
  trailing whitespace (the renderer keeps trailing newlines), and validates
  the resolved literal, raising a clear `ValidationError` on a bad
  resolution. The provider then sees a concrete enum — no provider changes.

Providers (reasoning.py / context_tier.py)
- `resolve_*` now `cast(...)` the widened field back to the strict enum:
  by the time these run (provider execute, after the executor resolves the
  field) the value is a concrete, validated literal.

Validator (src/conductor/config/validator.py)
- The reasoning.effort capability cross-check now skips only the
  value-DEPENDENT membership check for templates. The value-INDEPENDENT
  "provider supports no reasoning at all" check (`reasoning_effort=None`)
  still fires for templated efforts — otherwise a templated effort on a
  no-reasoning provider (e.g. claude-agent-sdk) would pass `validate` while
  a literal errors, silently dropping operator intent.

## Per pre-impl + DUAL-model post-impl rubber-duck

Pre-impl RD: confirmed the `ty`-type-safety angle (the `| str` widening
propagates into the `resolve_*` return types) and resolved it with honest
`cast()`s guarded by the runtime-resolution guarantee.

Post-impl RD ran TWO models in parallel (Claude Opus 4.8 + GPT-5.5) and
both independently flagged the validator gap above (GPT P1 / Opus P2) —
fixed by reordering the capability check. Opus additionally caught a
schema/executor asymmetry: the schema deferred only on `{{` while the
executor renders `{{` or `{%`, so a statement-style `{% if %}` template was
rejected at load though the executor would render it. Both schema validators
now mirror the executor guard, enabling conditional templates end-to-end
(verified: `{% if workflow.input.heavy %}xhigh{% else %}low{% endif %}`
resolves to xhigh/low per input). Opus verified 8/10 risk areas fully clean
(no unrendered template reaches any SDK; dialog/for_each/retry/claude-budget
all route through the resolved agent copy).

## Tests

- tests/test_config/test_schema.py: templated effort/context_tier deferred
  (incl. `{% %}` statement templates), literal-invalid still rejected,
  human_gate/script/workflow/etc still reject templated fields.
- tests/test_executor/test_agent.py: runtime render for both fields, static
  pass-through, no-mutation of the original templated agent, bad-resolution
  raises, trailing-newline strip.
- tests/test_config/test_validator_capabilities.py: templated effort defers
  the membership check; templated effort on a no-reasoning provider still
  errors.
- tests/test_cli/test_validate.py + tests/fixtures/valid_templated_enums.yaml:
  `conductor validate` accepts a templated-enum workflow (exit 0).

## Verification

- ruff check + format: clean
- ty check src: 12 diagnostics (== baseline; 0 introduced — all pre-existing
  Windows termios / dialog stubs)
- pytest: 3526 passed; the 13 failures are identical on a clean
  upstream/main checkout (pre-existing Windows-env: registry TOML, event_log
  non-serializable, script stdin tty) — 0 new failures.

## Out of scope (deliberate non-goals)

- Workflow-level `runtime.default_reasoning_effort` / `default_context_tier`
  templating — the issue's acceptance criteria are per-agent only; these
  remain strict enums (a templated workflow-level default is rejected with a
  clear message). Can follow up separately.
- A generalized `Annotated[...]` "templatable enum" type — starting with
  `| str` + the duration-style validator to match existing code; generalize
  only if a third enum field needs it.

Closes microsoft#262
@codecov-commenter

codecov-commenter commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@5d769eb). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #263   +/-   ##
=======================================
  Coverage        ?   86.41%           
=======================================
  Files           ?       69           
  Lines           ?    12149           
  Branches        ?        0           
=======================================
  Hits            ?    10498           
  Misses          ?     1651           
  Partials        ?        0           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…-check (microsoft#262 pr-review R1-001)

Multi-agent pr-review (claude-opus-4.8 + gpt-5.5, full consensus, 91/100,
0 disputed) caught a self-consistency gap in the validator capability
cross-check added by this PR.

The schema (`_validate_effort` / `_validate_context_tier`), the runtime
executor (`_render_enum_field` guard), and the `context_tier` validator all
defer on `{{` OR `{%`, but the `reasoning.effort` capability cross-check in
`config/validator.py` deferred the value-dependent membership check only for
`{{`-style templates. So a `{% if %}...{% endif %}` effort on a
reasoning-capable provider (non-None `reasoning_effort` tuple, e.g. copilot)
fell through to `requested not in supported` and made
`validate_workflow_config` raise — at odds with this PR's own statement-style
support (and the microsoft#262 acceptance criterion that `conductor validate` must not
reject a templated enum field). Scope was `conductor validate` only; the
runtime path was already correct.

Fix: mirror the shared `{{`-or-`{%` predicate used everywhere else.

Added a validator-capabilities regression test feeding a
`{% if %}...{% endif %}` effort through `validate_workflow_config` against a
provider with a restricted non-None `reasoning_effort` tuple, asserting it
does not raise (the existing test only covered `{{`).

Verification: ruff + ty clean (ty == 12 baseline); 3527 passed (the 13
failures are identical on clean upstream/main — pre-existing Windows env).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@franklixuefei

Copy link
Copy Markdown
Member Author

@jrob5756 kindly requesting your review on this, thanks Jason!

1 similar comment
@franklixuefei

Copy link
Copy Markdown
Member Author

@jrob5756 kindly requesting your review on this, thanks Jason!

@jrob5756 jrob5756 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a full pass over this: code quality, tests, types, comments, and error handling. The implementation is correct. The widen-at-schema, defer, render-in-executor, cast-in-resolver chain holds on every path I traced, and the new tests pass. No blocking issues.

A few non-blocking things, in rough priority:

Harden the two resolver casts (see the inline notes on reasoning.py and context_tier.py). The cast is sound today only because the executor always renders first, and the one direct provider.execute caller (the output validator) builds a synthetic agent without these fields. Nothing enforces that, so a guard is cheap insurance against a future caller.

The ("{{" in v or "{%" in v) check is now duplicated across five sites: both schema validators, the two executor blocks, and the capability cross-check. A shared is_jinja_template() helper keeps them from drifting. One catch worth flagging: it has to return TypeGuard[str], not bool, or the executor loses the narrowing it depends on and ty fails. A leaf module such as conductor/templating.py avoids an import cycle. Leave _validate_wait_duration as-is; its {{-only check is deliberate.

Two test gaps. Nothing renders a {% %} statement template through the executor, which is the only layer that actually renders it; if that guard regressed, the schema and validator tests would still pass. Separately, an empty resolution (a {% if %} with no else) raises instead of falling back to the runtime default, which is a real behavior choice worth pinning with a test.

Pre-existing, not introduced here: a context_tier set on a non-Copilot provider is silently dropped, with no capability check and no warning, unlike reasoning effort which is rejected at validate time. Templating extends that silent drop to the templated path. Probably a follow-up.

Comment thread src/conductor/providers/reasoning.py Outdated
Comment thread src/conductor/providers/context_tier.py Outdated
Comment thread src/conductor/executor/agent.py Outdated
Comment thread src/conductor/config/schema.py Outdated
Comment thread src/conductor/config/schema.py Outdated
@franklixuefei

franklixuefei commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

I ran a full pass over this: code quality, tests, types, comments, and error handling. The implementation is correct. The widen-at-schema, defer, render-in-executor, cast-in-resolver chain holds on every path I traced, and the new tests pass. No blocking issues.

A few non-blocking things, in rough priority:

Harden the two resolver casts (see the inline notes on reasoning.py and context_tier.py). The cast is sound today only because the executor always renders first, and the one direct provider.execute caller (the output validator) builds a synthetic agent without these fields. Nothing enforces that, so a guard is cheap insurance against a future caller.

The ("{{" in v or "{%" in v) check is now duplicated across five sites: both schema validators, the two executor blocks, and the capability cross-check. A shared is_jinja_template() helper keeps them from drifting. One catch worth flagging: it has to return TypeGuard[str], not bool, or the executor loses the narrowing it depends on and ty fails. A leaf module such as conductor/templating.py avoids an import cycle. Leave _validate_wait_duration as-is; its {{-only check is deliberate.

Two test gaps. Nothing renders a {% %} statement template through the executor, which is the only layer that actually renders it; if that guard regressed, the schema and validator tests would still pass. Separately, an empty resolution (a {% if %} with no else) raises instead of falling back to the runtime default, which is a real behavior choice worth pinning with a test.

Pre-existing, not introduced here: a context_tier set on a non-Copilot provider is silently dropped, with no capability check and no warning, unlike reasoning effort which is rejected at validate time. Templating extends that silent drop to the templated path. Probably a follow-up.

Thanks @jrob5756! I addressed all comments.

re: shared is_jinja_template() helper (TypeGuard[str], leaf module)

Done (a4c698c) — added conductor/templating.py with is_jinja_template(value) -> TypeGuard[str], a leaf module with no intra-conductor imports so there's no cycle. Routed the duplicated sites through it: both schema validators, the two executor blocks, and the capability cross-check — plus the model render block (same predicate, same function, so leaving it inline would invite exactly the drift you're guarding against) and the two new resolver guards. _validate_wait_duration stays as-is per your note (its {{-only check is deliberate), as does router.py's {{+}} when check (a different predicate). Confirmed it returns TypeGuard[str], not bool — ty passes and the executor keeps its narrowing into _render_enum_field. Added a small unit test for the helper itself (detects {{ and {%, rejects plain strings and non-strings — None in particular, so the model guard could drop its old agent.model and … short-circuit).

re: test gap ({% %} statement through the executor)

Added (a4c698c) — test_statement_template_effort_is_rendered drives a {% if … %}…{% else %}…{% endif %} statement template through the executor (the only layer that renders), plus a matching one for context_tier. As you said, the schema/validator tests would have stayed green if that guard regressed, so these pin the actual render path.

re: review summary: empty-resolution behavior choice

Pinned (a4c698c) — test_effort_resolving_to_empty_raises covers a {% if … %}…{% endif %} with no matching branch (resolves to "").

On the behavior itself, I made a deliberate call to keep it fail-closed (raise) rather than fall back to the runtime default, and gave the empty case its own actionable message ("…resolved to an empty value" + a hint to add an else-branch or omit the field). The reasoning: it's consistent with the surrounding contracts — _render_enum_field already rejects any value not in allowed, and the new B1/B2 resolver guards are fail-closed too — so treating empty as a special "unset → default" would add a third, silent meaning for empty that could mask a genuinely broken conditional. It also keeps this a pure pin rather than a behavior change with a type wrinkle: ReasoningConfig.effort is ReasoningEffort | str (not None-able), so "fall back to default" would mean making the override itself droppable, not just resolving it.

If you'd rather it fall back to the runtime default (treat an empty conditional as "no override"), I'm happy to flip it — I mainly wanted to pin the current behavior and surface the choice explicitly, as you suggested.

re: context_tier silently dropped on non-Copilot providers (pre-existing)

Agreed this is worth a follow-up, and that it's pre-existing — templating just extends the existing silent drop to the templated path. It's a real asymmetry vs reasoning effort (which is rejected at validate time via the capability cross-check). I'd keep it out of this PR to keep the scope tight; happy to open an issue to add a capability-style check (or at least a warning) for context_tier on providers that don't consume it. Let me know if you'd like me to file that.

@jrob5756

Copy link
Copy Markdown
Collaborator

Thanks for turning these around fast. One snag: commit a4c698c that all five replies point to isn't on the branch.

The head is still f5d5ee52 from June 19 (2 commits, both pre-review), and when I pull the live files none of the changes are there:

  • reasoning.py / context_tier.py resolvers still do a bare cast with no guard
  • the shared is_jinja_template() helper and templating.py module 404
  • the agent.py comment still says Enum | str | None
  • the schema docstrings still claim they mirror _validate_wait_duration

Looks like the push landed on a different branch, or got dropped. Could you push a4c698c to xuefl/templatable-reasoning-enums? Once the head moves I'll re-verify. None of it's blocking; I just want the branch history to match the replies.

@franklixuefei

Copy link
Copy Markdown
Member Author

Thanks for turning these around fast. One snag: commit a4c698c that all five replies point to isn't on the branch.

The head is still f5d5ee52 from June 19 (2 commits, both pre-review), and when I pull the live files none of the changes are there:

  • reasoning.py / context_tier.py resolvers still do a bare cast with no guard
  • the shared is_jinja_template() helper and templating.py module 404
  • the agent.py comment still says Enum | str | None
  • the schema docstrings still claim they mirror _validate_wait_duration

Looks like the push landed on a different branch, or got dropped. Could you push a4c698c to xuefl/templatable-reasoning-enums? Once the head moves I'll re-verify. None of it's blocking; I just want the branch history to match the replies.

Ah sorry about that... now my local commit is pushed, please check again. Thanks Jason!

… + resolver guards

Jason's review feedback on PR microsoft#263:

- B6: extract the {{ }}/{% %} predicate into conductor/templating.py
  (is_jinja_template, a TypeGuard[str] leaf module) and route all six
  enum-template sites + the two new guards through it. _validate_wait_duration
  ({{-only) and router.py ({{+}}) are intentionally left.
- B1/B2: guard resolve_reasoning_effort / resolve_context_tier so an unrendered
  template raises instead of being cast straight to the SDK (context_tier is
  forwarded unvalidated — the riskier path). Fix the "AgentDef widens effort"
  comment (effort is on ReasoningConfig).
- B3: agent.py comment — ReasoningEffort/ContextTier are Literal aliases, not
  Enums (hence get_args).
- B4/B5: schema validator docstrings no longer claim to "mirror
  _validate_wait_duration" (that method is {{-only and never defers None);
  point at the executor's model-render path instead.
- B7: executor test for a {% %} statement template (only {{ }} was covered).
- B8: pin the empty-resolution behavior — a conditional with no matching branch
  resolves to "" and fails closed with an actionable message (consistent with
  _render_enum_field and the new guards) rather than silently using the default.

ruff + ty clean; targeted suites green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@franklixuefei franklixuefei force-pushed the xuefl/templatable-reasoning-enums branch from a4c698c to 7d42fb0 Compare June 24, 2026 23:21
@franklixuefei franklixuefei requested a review from jrob5756 June 24, 2026 23:22

@jrob5756 jrob5756 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Re-verified at 7d42fb06: all eight points are in, the shared is_jinja_template helper is wired through every site, both resolver guards are there, and the docstrings/comments read correctly now. Lint and format clean, ty is at baseline (the lone dialog_evaluator diagnostic is pre-existing), and the affected suites are green (357 passed). Thanks for the thorough turnaround.

@jrob5756 jrob5756 merged commit ff02c8d into microsoft:main Jun 24, 2026
9 checks passed
jrob5756 added a commit that referenced this pull request Jun 27, 2026
Bump version 0.1.19 -> 0.1.20 and finalize the changelog.

Changelog (0.1.20):
- Added: Hermes provider (#235), cost budget enforcement (#212),
  external-workflow-friction knobs — output_mode / max_parse_recovery_attempts /
  gate-respond CLI / Windows paths (#234), templated reasoning.effort &
  context_tier (#263), scoped applyTo instruction loading (#238).
- Fixed: Copilot model attribution for auto-routed runs (#268),
  claude-agent-sdk default tool preset when tools: omitted (#269).

Re-locked uv.lock to record 0.1.20. Quality gates green locally
(ruff, ty, pytest excl. real_api/performance: 3673 passed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow templating of per-agent context_tier / reasoning.effort (resolve {{ workflow.input.* }} before enum validation)

4 participants