feat: allow templating per-agent reasoning.effort and context_tier (#262)#263
Conversation
…icrosoft#262) Per-agent `reasoning.effort` (low/medium/high/xhigh) and `context_tier` (default/long_context) were strict pydantic `Literal` enums, so a templated value like `effort: "{{ workflow.input.eff }}"` was rejected at YAML load — before `--input` / Jinja2 templates are resolved. `model:` already escaped this because it is a free-form `str` resolved at runtime. This implements the issue's Approach 2 (defer enum validation to runtime), mirroring the existing, shipped precedent for the `wait` step's `duration` field (templatable-but-validated). Approach 1 (resolve templates before enum validation in the loader) was evaluated and rejected: the loader has no `--input`/`workflow.input` defaults available, `conductor validate` has no inputs at all, and pre-resolving collapses back into Approach 2's surgery anyway. See PR description for the full grounded comparison. ## Changes Schema (src/conductor/config/schema.py) - Widen `ReasoningConfig.effort` to `ReasoningEffort | str` and add a `@model_validator(mode="after")` that defers `{{`/`{%` templates and validates literals against `get_args(ReasoningEffort)`. - Widen `AgentDef.context_tier` to `ContextTier | str | None` and add `_validate_context_tier()` dispatched from the regular-agent branch of `validate_agent_type` (non-agent types still reject the field outright — a template string is still "not None"). - Docstrings advertise template support, mirroring the `model` docstring. Runtime (src/conductor/executor/agent.py) - After the existing `model` render, render + validate templated `reasoning.effort` / `context_tier` with the full agent context (covers main / retry / dialog / for_each — every path routes through `AgentExecutor.execute`). A `_render_enum_field` helper renders, strips trailing whitespace (the renderer keeps trailing newlines), and validates the resolved literal, raising a clear `ValidationError` on a bad resolution. The provider then sees a concrete enum — no provider changes. Providers (reasoning.py / context_tier.py) - `resolve_*` now `cast(...)` the widened field back to the strict enum: by the time these run (provider execute, after the executor resolves the field) the value is a concrete, validated literal. Validator (src/conductor/config/validator.py) - The reasoning.effort capability cross-check now skips only the value-DEPENDENT membership check for templates. The value-INDEPENDENT "provider supports no reasoning at all" check (`reasoning_effort=None`) still fires for templated efforts — otherwise a templated effort on a no-reasoning provider (e.g. claude-agent-sdk) would pass `validate` while a literal errors, silently dropping operator intent. ## Per pre-impl + DUAL-model post-impl rubber-duck Pre-impl RD: confirmed the `ty`-type-safety angle (the `| str` widening propagates into the `resolve_*` return types) and resolved it with honest `cast()`s guarded by the runtime-resolution guarantee. Post-impl RD ran TWO models in parallel (Claude Opus 4.8 + GPT-5.5) and both independently flagged the validator gap above (GPT P1 / Opus P2) — fixed by reordering the capability check. Opus additionally caught a schema/executor asymmetry: the schema deferred only on `{{` while the executor renders `{{` or `{%`, so a statement-style `{% if %}` template was rejected at load though the executor would render it. Both schema validators now mirror the executor guard, enabling conditional templates end-to-end (verified: `{% if workflow.input.heavy %}xhigh{% else %}low{% endif %}` resolves to xhigh/low per input). Opus verified 8/10 risk areas fully clean (no unrendered template reaches any SDK; dialog/for_each/retry/claude-budget all route through the resolved agent copy). ## Tests - tests/test_config/test_schema.py: templated effort/context_tier deferred (incl. `{% %}` statement templates), literal-invalid still rejected, human_gate/script/workflow/etc still reject templated fields. - tests/test_executor/test_agent.py: runtime render for both fields, static pass-through, no-mutation of the original templated agent, bad-resolution raises, trailing-newline strip. - tests/test_config/test_validator_capabilities.py: templated effort defers the membership check; templated effort on a no-reasoning provider still errors. - tests/test_cli/test_validate.py + tests/fixtures/valid_templated_enums.yaml: `conductor validate` accepts a templated-enum workflow (exit 0). ## Verification - ruff check + format: clean - ty check src: 12 diagnostics (== baseline; 0 introduced — all pre-existing Windows termios / dialog stubs) - pytest: 3526 passed; the 13 failures are identical on a clean upstream/main checkout (pre-existing Windows-env: registry TOML, event_log non-serializable, script stdin tty) — 0 new failures. ## Out of scope (deliberate non-goals) - Workflow-level `runtime.default_reasoning_effort` / `default_context_tier` templating — the issue's acceptance criteria are per-agent only; these remain strict enums (a templated workflow-level default is rejected with a clear message). Can follow up separately. - A generalized `Annotated[...]` "templatable enum" type — starting with `| str` + the duration-style validator to match existing code; generalize only if a third enum field needs it. Closes microsoft#262
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #263 +/- ##
=======================================
Coverage ? 86.41%
=======================================
Files ? 69
Lines ? 12149
Branches ? 0
=======================================
Hits ? 10498
Misses ? 1651
Partials ? 0 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…-check (microsoft#262 pr-review R1-001) Multi-agent pr-review (claude-opus-4.8 + gpt-5.5, full consensus, 91/100, 0 disputed) caught a self-consistency gap in the validator capability cross-check added by this PR. The schema (`_validate_effort` / `_validate_context_tier`), the runtime executor (`_render_enum_field` guard), and the `context_tier` validator all defer on `{{` OR `{%`, but the `reasoning.effort` capability cross-check in `config/validator.py` deferred the value-dependent membership check only for `{{`-style templates. So a `{% if %}...{% endif %}` effort on a reasoning-capable provider (non-None `reasoning_effort` tuple, e.g. copilot) fell through to `requested not in supported` and made `validate_workflow_config` raise — at odds with this PR's own statement-style support (and the microsoft#262 acceptance criterion that `conductor validate` must not reject a templated enum field). Scope was `conductor validate` only; the runtime path was already correct. Fix: mirror the shared `{{`-or-`{%` predicate used everywhere else. Added a validator-capabilities regression test feeding a `{% if %}...{% endif %}` effort through `validate_workflow_config` against a provider with a restricted non-None `reasoning_effort` tuple, asserting it does not raise (the existing test only covered `{{`). Verification: ruff + ty clean (ty == 12 baseline); 3527 passed (the 13 failures are identical on clean upstream/main — pre-existing Windows env). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@jrob5756 kindly requesting your review on this, thanks Jason! |
1 similar comment
|
@jrob5756 kindly requesting your review on this, thanks Jason! |
jrob5756
left a comment
There was a problem hiding this comment.
I ran a full pass over this: code quality, tests, types, comments, and error handling. The implementation is correct. The widen-at-schema, defer, render-in-executor, cast-in-resolver chain holds on every path I traced, and the new tests pass. No blocking issues.
A few non-blocking things, in rough priority:
Harden the two resolver casts (see the inline notes on reasoning.py and context_tier.py). The cast is sound today only because the executor always renders first, and the one direct provider.execute caller (the output validator) builds a synthetic agent without these fields. Nothing enforces that, so a guard is cheap insurance against a future caller.
The ("{{" in v or "{%" in v) check is now duplicated across five sites: both schema validators, the two executor blocks, and the capability cross-check. A shared is_jinja_template() helper keeps them from drifting. One catch worth flagging: it has to return TypeGuard[str], not bool, or the executor loses the narrowing it depends on and ty fails. A leaf module such as conductor/templating.py avoids an import cycle. Leave _validate_wait_duration as-is; its {{-only check is deliberate.
Two test gaps. Nothing renders a {% %} statement template through the executor, which is the only layer that actually renders it; if that guard regressed, the schema and validator tests would still pass. Separately, an empty resolution (a {% if %} with no else) raises instead of falling back to the runtime default, which is a real behavior choice worth pinning with a test.
Pre-existing, not introduced here: a context_tier set on a non-Copilot provider is silently dropped, with no capability check and no warning, unlike reasoning effort which is rejected at validate time. Templating extends that silent drop to the templated path. Probably a follow-up.
Thanks @jrob5756! I addressed all comments. re: shared Done ( re: test gap ( Added ( re: review summary: empty-resolution behavior choice Pinned ( On the behavior itself, I made a deliberate call to keep it fail-closed (raise) rather than fall back to the runtime default, and gave the empty case its own actionable message ("…resolved to an empty value" + a hint to add an else-branch or omit the field). The reasoning: it's consistent with the surrounding contracts — If you'd rather it fall back to the runtime default (treat an empty conditional as "no override"), I'm happy to flip it — I mainly wanted to pin the current behavior and surface the choice explicitly, as you suggested. re: Agreed this is worth a follow-up, and that it's pre-existing — templating just extends the existing silent drop to the templated path. It's a real asymmetry vs reasoning effort (which is rejected at validate time via the capability cross-check). I'd keep it out of this PR to keep the scope tight; happy to open an issue to add a capability-style check (or at least a warning) for |
|
Thanks for turning these around fast. One snag: commit The head is still
Looks like the push landed on a different branch, or got dropped. Could you push |
Ah sorry about that... now my local commit is pushed, please check again. Thanks Jason! |
… + resolver guards Jason's review feedback on PR microsoft#263: - B6: extract the {{ }}/{% %} predicate into conductor/templating.py (is_jinja_template, a TypeGuard[str] leaf module) and route all six enum-template sites + the two new guards through it. _validate_wait_duration ({{-only) and router.py ({{+}}) are intentionally left. - B1/B2: guard resolve_reasoning_effort / resolve_context_tier so an unrendered template raises instead of being cast straight to the SDK (context_tier is forwarded unvalidated — the riskier path). Fix the "AgentDef widens effort" comment (effort is on ReasoningConfig). - B3: agent.py comment — ReasoningEffort/ContextTier are Literal aliases, not Enums (hence get_args). - B4/B5: schema validator docstrings no longer claim to "mirror _validate_wait_duration" (that method is {{-only and never defers None); point at the executor's model-render path instead. - B7: executor test for a {% %} statement template (only {{ }} was covered). - B8: pin the empty-resolution behavior — a conditional with no matching branch resolves to "" and fails closed with an actionable message (consistent with _render_enum_field and the new guards) rather than silently using the default. ruff + ty clean; targeted suites green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a4c698c to
7d42fb0
Compare
jrob5756
left a comment
There was a problem hiding this comment.
LGTM. Re-verified at 7d42fb06: all eight points are in, the shared is_jinja_template helper is wired through every site, both resolver guards are there, and the docstrings/comments read correctly now. Lint and format clean, ty is at baseline (the lone dialog_evaluator diagnostic is pre-existing), and the affected suites are green (357 passed). Thanks for the thorough turnaround.
Bump version 0.1.19 -> 0.1.20 and finalize the changelog. Changelog (0.1.20): - Added: Hermes provider (#235), cost budget enforcement (#212), external-workflow-friction knobs — output_mode / max_parse_recovery_attempts / gate-respond CLI / Windows paths (#234), templated reasoning.effort & context_tier (#263), scoped applyTo instruction loading (#238). - Fixed: Copilot model attribution for auto-routed runs (#268), claude-agent-sdk default tool preset when tools: omitted (#269). Re-locked uv.lock to record 0.1.20. Quality gates green locally (ruff, ty, pytest excl. real_api/performance: 3673 passed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Allows per-agent
reasoning.effortandcontext_tierto be templated with{{ workflow.input.* }}(and{% %}statement) Jinja2 expressions, resolved at runtime. Closes #262.Both fields were strict pydantic
Literalenums, so a templated value likeeffort: "{{ workflow.input.eff }}"was rejected at YAML load — before--input/ Jinja2 templates are resolved.model:already escaped this because it is a free-formstrresolved at runtime.Approach: defer enum validation to runtime (issue Approach 2)
The issue marks Approach 1 ("resolve templates before enum validation") as preferred, but after reading the code Approach 2 is the idiomatic, lower-risk fit — and it mirrors a shipped, load-bearing precedent in the same file: the
waitstep'sduration(templatable-but-validated via_validate_wait_duration+ runtime render inexecutor/wait.py).Why Approach 1 was rejected (grounded in the code):
--inputvalues +workflow.inputdefaults are merged later in the engine, not inconfig/loader.py.conductor validatehas no inputs at all, so a{{ workflow.input.x }}could never resolve during validate — you'd still need a "defer unresolved template" path.workflow.inputdefaults live inside the same YAML thatmodel_validateparses.workflow.input, only these enum fields" converges right back to Approach 2's surgery anyway.Changes
Schema (
config/schema.py)ReasoningConfig.effort→ReasoningEffort | str+ a@model_validator(mode="after")that defers{{/{%templates and validates literals viaget_args.AgentDef.context_tier→ContextTier | str | None+_validate_context_tier()dispatched from the regular-agent branch ofvalidate_agent_type. Non-agent step types still reject the field outright (a template string is still "not None").Runtime (
executor/agent.py)modelrender, render + validate templatedreasoning.effort/context_tierwith the full agent context. Every execution path (main / retry / dialog / for_each) routes throughAgentExecutor.execute, so this single chokepoint covers them all. A_render_enum_fieldhelper renders, strips trailing whitespace, validates the resolved literal, and raises a clearValidationErroron a bad resolution. The provider then sees a concrete enum — no provider changes.Providers (
providers/reasoning.py,providers/context_tier.py)resolve_*cast(...)the widened field back to the strict enum — by the time they run (provider execute, after the executor resolves the field) the value is concrete and validated.Validator (
config/validator.py)reasoning_effort=None) still fires for templated efforts — otherwise a templated effort on a no-reasoning provider would passvalidatewhile a literal errors.Rigor
Pre-impl + dual-model post-impl rubber-duck (Claude Opus 4.8 + GPT-5.5, run in parallel). Both independently flagged the validator capability gap above (fixed). Opus additionally caught a schema/executor asymmetry — the schema deferred only on
{{while the executor renders{{or{%, so a{% if %}template was rejected at load though the executor would render it; both schema validators now mirror the executor guard. Verified end-to-end:{% if workflow.input.heavy %}xhigh{% else %}low{% endif %}resolves toxhigh/lowper input.Tests
{% %}statements); literal-invalid still rejected; human_gate/script/workflow/etc still reject templated fields.conductor validateaccepts a templated-enum workflow (exit 0), new fixturevalid_templated_enums.yaml.Verification
ruff check+ruff format --check: cleanty check src: 12 diagnostics, == baseline (0 introduced; all pre-existing Windows termios / dialog stubs)pytest: 3526 passed; the 13 failures are identical on a cleanupstream/maincheckout (pre-existing Windows-env: registry TOML, event_log non-serializable, script stdin tty) — 0 new failures.Out of scope (deliberate non-goals)
runtime.default_reasoning_effort/default_context_tiertemplating — the AC is per-agent only; these remain strict enums (a templated workflow-level default is rejected with a clear message). Can follow up separately.Annotated[...]"templatable enum" type — starting with| str+ the duration-style validator to match existing code; generalize only if a third enum field needs it.A minor cosmetic note: the
workflow_startedtelemetry event reports the raw (unrendered) per-agent template, since per-for_each-item resolutions differ. This dict is telemetry-only (never sent to a provider), so there's no correctness impact — flagging it as a conscious choice.