feat: Phase 1 of typed on_error routing (#227)#229
Conversation
…ting
Phase 1, step 1 of on_error routing (per design brief in
docs/projects/error-routing/on-error-routing.brainstorm.md).
Adds two opt-in schema fields and a small shared module for the
constants both schema validation and the engine error path will use:
- src/conductor/error_kinds.py
- KIND_PATTERN: dotted lowercase identifier (at least one dot).
- RESERVED_KIND_PREFIXES: internal., provider., subworkflow., retry.
(the runtime owns these namespaces; workflow authors cannot declare
kinds under them).
- RESERVED_ON_ERROR_ALLOWLIST: the closed set of runtime-synthesized
kinds that ARE legal to match in on_error even though they're not
legal to declare in raises (internal.script_error,
internal.schema_violation, internal.undeclared_kind).
- is_reserved_prefix(kind) helper.
- RouteDef.on_error: bool | str | list[str] | None
- None = success route (existing behavior).
- True = catch-all error route.
- str = single-kind error route.
- list[str] = multi-kind error route.
- False is rejected (no semantic meaning).
- Kind format enforced via KIND_PATTERN.
- 'before'-mode validator so Pydantic's bool/str coercion doesn't
swallow the discriminator.
- AgentDef.raises: list[str] | None
- Optional declaration of kinds the node may raise.
- Powers a load-time lint (cross-checked against routes' on_error in
the validator, landing in a follow-up commit) and a runtime
undeclared-kind check (will land with the engine-wiring commit).
- Reserved prefixes rejected so authors can't claim runtime
namespaces; duplicates rejected; format enforced via KIND_PATTERN.
Tests:
- tests/test_error_kinds.py — 24 cases covering pattern + prefix +
allowlist invariants (allowlist entries must themselves be reserved).
- tests/test_config/test_schema.py::TestRouteDefOnError — 14 cases.
- tests/test_config/test_schema.py::TestAgentDefRaises — 10 cases.
No semantics change for existing workflows: both fields default to None
and the engine doesn't observe them yet (wiring lands in subsequent
commits).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…exceptions
Phase 1, step 2 of on_error routing.
Adds src/conductor/engine/errors.py:
- ErrorEnvelope TypedDict — the internal {kind, message, details}
shape. Strips the on-the-wire conductor_error: true discriminator so
callers don't see it in {{ failing_node.error.* }} templates.
- EnvelopeValidationError — distinct from ValidationError so the
engine can catch and translate malformed envelopes into synthetic
internal.* kinds rather than halting with a generic config error.
- coerce_envelope(raw) — validates on-the-wire input, normalizes
details to {} when absent.
- make_script_error(exit_code, stderr_tail, command) — synthesizes
the internal.script_error envelope.
- make_schema_violation(node_name, source, original_message,
failed_field?) — synthesizes the internal.schema_violation envelope
with rich details for the swallowed-by-catch-all diagnostics case.
- wrap_undeclared_kind(original, declared) — wraps an envelope whose
kind isn't in the node's declared raises list. Preserves the
original kind/message/details under details.original_* so an author
handling internal.undeclared_kind can still recover the intent.
Adds two exceptions to src/conductor/exceptions.py:
- UnhandledNodeError — internal signal raised by the router when an
error envelope reaches no on_error route at the current level. The
engine catches this at the per-node dispatch site and re-raises as
UnhandledWorkflowError. Not intended to surface to end users.
- UnhandledWorkflowError — workflow halted on a typed error envelope.
Carries the envelope and a frame trail (single frame in Phase 1;
Phase 2 will accumulate frames across sub-workflow boundaries).
CLI maps this to a distinct exit code so callers can distinguish
'workflow ran and halted on typed error' from generic failures.
Tests: tests/test_engine/test_errors.py — 18 cases covering envelope
coercion (including discriminator stripping and details normalization),
the three synthetic-kind constructors, and the exception classes
including the empty-frames defensive path.
Nothing yet emits these envelopes or exceptions; the next commits wire
them through the router, executors, and engine dispatch.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Route evaluation now partitions by RouteDef.on_error:
- on_error is None -> success bucket, evaluated when error=None
- on_error is set -> error bucket, evaluated when an envelope is
passed; the route's on_error matcher (True | str | list[str])
must match envelope[kind]
Behavior preserved on the success path: first matching when: wins, no
catch-all raises the existing ValueError. New: error-path exhaustion
raises UnhandledNodeError carrying the envelope so the engine can
translate it into UnhandledWorkflowError at the call site.
Error-route eval context exposes the envelope as `error` for both
Jinja2 ({{ error.kind }}) and simpleeval (kind == 'x.y' via flatten).
Adds 12 tests in TestRouterErrorBucket covering bucket isolation,
all three on_error matcher shapes, when: combined with on_error,
output: transforms, ordering within the bucket, the new
UnhandledNodeError path, and the legacy ValueError path.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds WorkflowContext.store_error(agent_name, envelope) that co-locates
error envelopes with their producing node's slot. The rendered context
shape for a node is now `{node: {output?, error?}}`.
All three context modes surface errors:
- accumulate: each failing node gets `{agent: {error: envelope}}`
- last_only: failing last agent surfaces with `{error: ...}`
- explicit: declarations of the form `agent.error[.field]` copy
the whole envelope into ctx[agent]['error']. Envelopes are bounded
and templates commonly need `error.details.*`, so the runtime
never field-slices them.
Validator updates so existing semantic checks cover the new path:
- INPUT_REF_PATTERN gains an `error_agent`/`error_field` branch
matching `<agent>.error[.field]`.
- _OUTPUT_ATTRS includes the singular `error` so Jinja AST analysis
treats `{{ failing.error.kind }}` as a real output-class ref.
- TemplateRefs gains `agent_error_refs: set[str]` and the AST
walker populates it.
- The per-agent template walk emits the same explicit-mode
`undeclared input` warning for `.error` refs that `.output`
and group `.errors` already get.
- Unknown-agent checks cover the `.error` ref path.
- Parallel-group internal-dependency check rejects intra-group
`.error` refs too.
Checkpoint round-trip via to_dict/from_dict serializes `agent_errors`;
older checkpoints without the key restore as empty (backwards-compat).
Adds 14 tests in TestWorkflowContextStoreError, 5 INPUT_REF_PATTERN
shape tests, and 3 TemplateRefs error-extraction tests. Fixes the
test_empty_context dict-equality fixture to include the new field.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…type Adds two new helpers in conductor.config.validator: - _validate_on_error_routes(agent): hard-errors on_error routes on node types that don't raise envelopes in Phase 1 (human_gate, workflow); validates each kind matches KIND_PATTERN; if agent.raises is declared, every concrete kind in on_error must be in raises or RESERVED_ON_ERROR_ALLOWLIST (catch-all rue always legal). - _validate_group_routes_no_on_error(): rejects on_error routes on parallel and for_each groups (group-level envelopes are Phase 2). Both helpers are wired into validate_workflow_config(). 11 new tests cover plain agent + script (legal), human_gate + workflow + parallel + for_each (rejected), bad kind format, undeclared-kind cross-check, catch-all, reserved allowlist, and no-raises = no constraint. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ions AgentOutput grows an optional `error: dict | None` field that carries an ErrorEnvelope when the agent failed. AgentExecutor.execute() now, after the provider call returns: 1. If the response is a dict with `conductor_error: true`, coerces the envelope (or synthesizes `internal.schema_violation` when the envelope itself is malformed) and attaches it to output.error WITHOUT running validate_output (the declared output schema doesn't apply to error envelopes). 2. Otherwise runs validate_output, and on ValidationError synthesizes an `internal.schema_violation` envelope on output.error instead of raising. Partial outputs (from mid-agent interrupts) bypass both checks. The error module is imported lazily inside execute() to avoid a circular import via conductor.engine.__init__. Updated two existing tests to assert the new envelope contract instead of expecting ValidationError. Added three new tests covering well-formed envelopes, malformed envelopes, and the partial-output bypass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nthesize internal.script_error ScriptOutput grows an optional `error: dict | None` field. Each script run now allocates a tempfile via tempfile.mkstemp(), closes the fd immediately, and exposes the path in the env as CONDUCTOR_ERROR_OUT — set AFTER the agent.env merge so users cannot accidentally redirect or override it. After process.communicate() the executor reads the file: - empty / missing → no envelope - valid JSON envelope → coerce_envelope, attach to output.error - valid JSON but malformed envelope → internal.schema_violation - invalid JSON → internal.schema_violation If no envelope was written AND exit_code != 0 AND the node has opted into error routing (raises declared OR any on_error route present), the executor synthesizes an `internal.script_error` envelope. Legacy workflows that route on exit_code (no opt-in) keep their existing behavior. Temp file is always removed in finally — even on timeout/command-not-found. Added 7 tests covering: no envelope on success, well-formed envelope surfaces, user env cannot override, synthesized internal.script_error on opt-in, legacy non-zero with no opt-in keeps error=None, malformed envelope downgrades to schema_violation, temp file cleanup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ary, group failure carry envelope Wires the on_error contract through the workflow engine end-to-end: - _evaluate_routes() now accepts an optional error envelope; empty routes plus an unhandled envelope raises UnhandledNodeError instead of silently routing to \. - New _normalize_envelope_for_node() applies undeclared-kind wrapping (raises declared + kind not in raises + not in allowlist → internal.undeclared_kind with original kind preserved under details.original_kind). - New _handle_leaf_error() centralizes the leaf path: normalize, store_error, evaluate error routes, raise UnhandledWorkflowError with a single-leaf frame trail on no match. - Agent call site (~2583) and script call site (~2359) both branch on output.error BEFORE storage and BEFORE schema validation. Script's success path runs full output-schema validation as before; error path skips it. - Sub-workflow call sites catch UnhandledWorkflowError from child engines and re-raise as ExecutionError. Phase 1 invariant: envelopes do not propagate across the sub-workflow boundary. - ParallelAgentError and ForEachError grow an optional envelope field. The parallel/for_each child execution helpers detect output.error, normalize it, raise an ExecutionError tagged with ._envelope, and the existing failure_mode machinery records it. Downstream group consumers can inspect the typed envelope. - Tests: 6 new tests in tests/test_engine/test_error_routing.py covering agent envelope routing, unhandled-envelope halt, undeclared-kind normalization (with the rescue agent reading the original kind from context), success-path regression, script envelope routing, and legacy exit_code routing regression. Full suite (2887 tests) green; no regressions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…on UnhandledWorkflowError Phase 1 Step 9. When the engine raises UnhandledWorkflowError (a leaf node returned an error envelope that no on_error route matched), the new `except UnhandledWorkflowError` arm in `_execute_loop`: * writes a single-line `errors.jsonl` record under \\\/conductor/\\ using the same naming convention as the event log (\\conductor-<workflow>-<ts>-<run_id>.errors.jsonl\\), carrying the envelope, frame trail, and leaf node name; * emits a typed \\workflow_failed\\ event with \\�rror_type='UnhandledWorkflowError'\\ plus the envelope, frames and errors-jsonl path so dashboard/log subscribers can render the typed halt distinctly from a generic ConductorError; * runs the \\on_error\\ lifecycle hook and saves a checkpoint, for parity with the other failure paths; * re-raises so the CLI (Step 10) can map it to its distinct exit code. The new arm is placed *before* the existing \\�xcept ConductorError\\ so the typed halt is caught first. Generic ConductorError handling is unchanged. Adds two integration tests covering the jsonl artefact and the typed event. Also tightens a docstring that was 101 chars (pre-existing from Step 8 — caught now by ruff format check). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…stderr summary Phase 1 Step 10. Both \\ un\\ and \\ esume\\ now catch \\UnhandledWorkflowError\\ specifically before the generic \\�xcept Exception\\, render a typed panel to stderr via the new \\print_unhandled_workflow_error\\ helper, and exit with code 3 so callers (CI, polyphony, shell scripts) can distinguish 'workflow ran to completion and halted on a typed error' from a generic failure (code 1). The summary panel surfaces the leaf node, kind, message, optional \\details\\, and the path to the \\�rrors.jsonl\\ artefact. To make the path reachable from the CLI without holding a reference to the engine, \\_execute_loop\\ now also attaches the path to the exception instance (\\�xc.errors_jsonl_path\\) before re-raising. Adds two CLI tests: * unhandled typed halt exits 3, * generic \\RuntimeError\\ still exits 1 (regression guard). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n/node/dotnet
Phase 1 Step 11. New \\src/conductor/helpers/error/\\ directory with
one-file convenience modules for raising typed Conductor error
envelopes from script-type nodes. All helpers implement the same
contract: read \\CONDUCTOR_ERROR_OUT\\, write the
\\{conductor_error: true, kind, message, details?}\\ JSON envelope to
that path, and return — leaving exit-code management to the caller.
* \\Conductor.Error.psm1\\ — PowerShell, \\Write-ConductorError\\ cmdlet
* \\conductor-error.sh\\ — Bash/sh, sourced \\conductor_error\\ function
* \\conductor_error.py\\ — Python, \\
aise_kind\\ function
* \\conductor-error.mjs\\ — Node, exported \\
aiseError\\
* \\ConductorError.cs\\ — .NET, static \\ConductorError.Raise\\
* \\README.md\\ — quick reference + usage examples per engine
Helpers ship under \\src/conductor/\\ so hatchling rolls them into
the wheel; verified by inspecting the built artefact. Nothing is
auto-loaded, on PATH, or on PYTHONPATH — script authors must
explicitly Import-Module / source / import to use them, and authors
who don't want them write the JSON themselves (it's three lines per
engine).
Adds \\ ests/test_helpers/test_error_helpers.py\\ with 6 cases
covering the Python helper's envelope shape, env-var contract, no-
sys.exit guarantee, and round-trip through \\coerce_envelope\\. The
non-Python helpers are exercised by the cross-engine integration test
landing in Step 13.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 1 Step 12. Two new example workflows under \\�xamples/\\ plus
an Error Routing section in \\�xamples/README.md\\.
\\�rror-routing.yaml\\:
* A \\ ype: script\\ probe that writes \\{conductor_error: true,
kind, message, details}\\ to \\\\\ and exits 0
via the language-neutral contract (no helper required).
* The probe declares \\
aises: [external.git.drift,
external.api.rate_limited]\\ so undeclared kinds get normalised to
\\internal.undeclared_kind\\.
* Routes select by \\on_error: <kind>\\ to demonstrate that an
envelope picks the typed arm over a generic exit_code fallback.
* Handlers read \\{{ probe.error.kind }}\\,
\\{{ probe.error.message }}\\, and \\{{ probe.error.details.* }}\\.
* A \\simulated_failure\\ workflow input toggles between
\\ok\\ / \\drift\\ / \\
ate_limited\\ so the same YAML exercises
all three arms.
\\�rror-routing-helpers.yaml\\:
* Same flow, but raises via the shipped Python helper
(\\conductor.helpers.error.conductor_error.raise_kind\\) instead of
hand-rolled JSON, so authors see the ergonomic version.
Both examples validate with \\conductor validate\\ and the full
\\make validate-examples\\ sweep (17/17 pass). Caught one writing
issue along the way: the input schema field is \\input:\\ (singular),
not \\inputs:\\ — fixed before committing.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 1 Step 13 (acceptance #1). New \\ ests/test_integration/test_error_routing_cross_engine.py\\ exercises the \\CONDUCTOR_ERROR_OUT\\ contract end-to-end through the real \\WorkflowEngine\\ (no executor mocking) with three writer scripts in different languages: Python, PowerShell (\\pwsh\\), and bash. All three share the same workflow YAML shape and the same expected outcome: the probe writes a typed envelope, the engine routes by \\on_error: external.git.drift\\, and a rescue agent reads the kind back from the routed-to scope. Each test is skipped when the corresponding interpreter is missing from PATH (and bash is also skipped on Windows, where \\shutil.which\\ typically returns the WSL relay shim that fails with an opaque \\�xecvpe(/bin/bash) failed: No such file or directory\\ — outside the scope of this contract test; the brief calls for bash-on-Linux specifically). Locally on Windows: 2 pass (python + pwsh), 1 skipped (bash). On a Linux CI runner with pwsh installed, all three execute. The contract is the same string of bytes in every engine, so identical assertions hold across them — which is the whole point. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ty's TypedDict to dict() conversion picks the wrong overload for our ErrorEnvelope, surfacing 8 spurious diagnostics. Use cast() at the leaf sites where TypedDict envelopes cross into APIs typed as dict[str, Any] (script/agent executor returns; workflow.py store_error and router.evaluate call sites). Brings ty count back to the 12-diagnostic baseline (all pre-existing Windows termios/tty noise). Lint, format, examples-validation, and full test suite (2943 pass, 12 baseline failures) all green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #229 +/- ##
=======================================
Coverage ? 88.20%
=======================================
Files ? 65
Lines ? 10115
Branches ? 0
=======================================
Hits ? 8922
Misses ? 1193
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Attempted to rebase Conflict summary: v0.1.18's set-step PR initializes output context as Built the dogfood from |
|
Tried rebasing this onto v0.1.18 base for polyphony dogfood integration, and hit a design conflict in The clash: PR #229 commit The Interim: Polyphony dogfood is staying pinned at v0.1.17 base ( |
1 similar comment
|
Tried rebasing this onto v0.1.18 base for polyphony dogfood integration, and hit a design conflict in The clash: PR #229 commit The Interim: Polyphony dogfood is staying pinned at v0.1.17 base ( |
jrob5756
left a comment
There was a problem hiding this comment.
Took a pass on the overall approach. I like the direction: success/error route buckets fit conductor's existing model cleanly, the opt-in synthesis keeps legacy exit_code routing untouched, and the phasing discipline (deferring subworkflow/group propagation as hard validation errors rather than silent no-ops) is the right call. Two pieces of feedback before this locks in the authoring contract:
1. The script transport is good — let's just document the asymmetry.
To be clear for others reading: CONDUCTOR_ERROR_OUT is a file path, not JSON-in-an-env-var, so escaping/size concerns don't apply — this is the $GITHUB_OUTPUT pattern and I think it's the right choice. The reason it's a file (and not a stdout discriminator like agents use) is that type: script already parses stdout as the output: contract, so errors need an out-of-band channel. That justification is sound, but it means the raise mechanism differs by node type (agents: in-band stdout conductor_error: true; scripts: out-of-band file). Please call that split out explicitly in docs/workflow-syntax.md so authors aren't surprised.
The parse-failure handling is solid — malformed JSON and malformed envelopes both downgrade to internal.schema_violation rather than dropping the signal, and reading only after communicate() returns means there's no write/read race on the non-atomic write. 👍 One thing to confirm in the correctness pass: malformed→schema_violation fires regardless of opt-in, whereas synthesized internal.script_error only fires when the node opted in. I think that's intended, but let's make sure a node that writes garbage with no on_error route halting on exit 3 (vs. legacy exit-code routing) is the behavior we want.
2. Trim the bundled multi-language helpers.
The contract itself is great precisely because it's language-neutral — "write a 3-field JSON file" — and you've kept it fully optional (the raw example uses no helper, plain echo > "$CONDUCTOR_ERROR_OUT" works). That part I'm fully on board with.
What I'd push back on is shipping conductor-error.sh, conductor-error.mjs, Conductor.Error.psm1, and ConductorError.cs inside the Python wheel. A pwsh/.NET/node user can't naturally consume a file buried in site-packages/conductor/helpers/error/ — there's no Import-Module/NuGet/npm story and the path is venv- and version-specific, so in practice these are copy-paste snippets dressed up as libraries. It's also five implementations of a trivial contract to keep in sync, and it undercuts the "it's just a JSON file" message.
Proposal: keep the Python helper (conductor is Python; inline python -c and script nodes make a real importable module genuinely useful) and demote the other four to documented copy-paste snippets in docs/ rather than shipping source in the wheel. Same ergonomic win where it's natural, without the maintenance/consumption awkwardness.
Neither of these is a structural objection — the file side-channel and the fail-safe parse handling are the right foundation. Mostly packaging + docs. Will follow up with correctness notes on the router/engine wiring.
| try: | ||
| with open(path, encoding="utf-8") as f: | ||
| raw = f.read() | ||
| except OSError: |
There was a problem hiding this comment.
Silent envelope drop on read failure when exit_code == 0.
Swallowing OSError → None here is fine when the script never wrote a typed error (the file legitimately doesn't exist / is empty), but it conflates that benign case with a real read failure on a path the engine itself created. Consider this path:
- Script writes a valid envelope to
$CONDUCTOR_ERROR_OUT(raising a typed error) and exits 0, relying on the file as the failure channel. - The read here raises
OSError(permissions, transient FS error, racy temp cleanup) →_read_error_envelopereturnsNone. - Back at
execute()(~line 191), synthesis is gated onexit_code != 0 and _node_uses_error_routing(agent), so nothing fires. ScriptOutput.errorstaysNone, the engine sees a success, and the node is routed down the success path — no envelope, no log, no event.
This is the exact silent-failure class typed error routing exists to prevent. An unexpected OSError on an engine-owned path should not be conflated with "script chose not to raise" — at minimum emit a diagnostic, and ideally synthesize a read-failure variant of internal.script_error / internal.schema_violation so the typed-failure signal is preserved regardless of exit code.
Side note: open(..., encoding="utf-8") defers decoding to f.read(), so a UnicodeDecodeError is not caught by except OSError here — that crashes rather than silently dropping, which is arguably the lesser evil but inconsistent.
| _LEAF_TYPES_THAT_RAISE: frozenset[str | None] = frozenset({"agent", "script", None}) | ||
|
|
||
|
|
||
| def _validate_on_error_routes(agent: Any) -> list[str]: |
There was a problem hiding this comment.
Validator gap + two related correctness items it would be natural to address together here.
a) Validator gap — node with only on_error routes crashes generically on the happy path. This validator accepts a node with only error routes and no success fallback. On a successful run, Router._evaluate_success (router.py:103-121) skips every route whose on_error is not None, then raises a plain ValueError("No matching route found..."). That bubbles up as a generic exception → CLI exit 1 → no errors.jsonl, no typed halt panel, despite the workflow being clearly authored for the typed-error world. Verified locally. Recommend rejecting at validation here with a teaching message like "Agent 'X' has on_error routes but no success route. Add a route without on_error to handle the happy path."
b) Related — _evaluate_error silently ignores output_for_route (engine/router.py:138-176, called from _handle_leaf_error at engine/workflow.py:4452). The plumbing accepts output_for_route but _evaluate_error only flattens error into the eval context and drops current_output. Any error-route when: clause referencing the failing node's output (e.g. attempt_count > 3) will silently see the previous agent's output. Either merge output_for_route into the error eval context (more useful) or drop the plumbing (more honest), but the current silent inconsistency is a latent footgun.
c) Related — agent schema-violation auto-upgrades existing workflows from exit 1 → exit 3 (executor/agent.py:282-302). Pre-PR: schema violation → ValidationError → exit 1. Post-PR: synthesizes internal.schema_violation envelope → unhandled → exit 3 + errors.jsonl. This is asymmetric with the script executor's _node_uses_error_routing opt-in (script.py:53-67), which preserves legacy behavior for unopted scripts. Either gate the agent upgrade on the same opt-in (so the asymmetry is principled, not accidental), or document the exit-code change in release notes.
| # path treats it like any other sub-workflow failure. Phase 2 | ||
| # will introduce envelope propagation with parent frames. | ||
| raise ExecutionError( | ||
| f"sub-workflow '{agent.name}' halted on unhandled error envelope " |
There was a problem hiding this comment.
Missing integration tests guard documented Phase-1 invariants.
Four integration boundaries have documented contracts but no tests; a regression on any of them would silently change semantics in ways unit tests won't catch.
-
Sub-workflow downgrade (this site and the mirror at line 1294). The module docstring at
tests/test_engine/test_error_routing.py:15-16claims "Phase 1 does NOT propagate envelopes across sub-workflow boundaries" — but no test exercises either branch. Add a workflow with atype: workflowstep whose child halts on an unhandled envelope; assert the parent seesExecutionError, notUnhandledWorkflowError, and that the parent's ownon_errorroutes do NOT fire. -
Parallel / for_each envelope plumbing.
ParallelAgentError.envelope/ForEachError.envelopeare populated in production atworkflow.py:3853, 3909, 4310, 4343from the_envelopeattribute attached at:3756and:4213. The only tests of these dataclasses (tests/test_engine/test_parallel.py:504-536) pre-date the PR and construct them in isolation. Add a test that runs a real parallel/for_each group with a typed-envelope child and asserts the envelope is preserved on the group's error. -
step.errorinput ref without producer'sraises. The validator (config/validator.py:1351walking output/outputs refs) doesn't cross-check a leaf-nodeproducer.errorreference againstproducer.raises. A consumer that declaresinputs: ["fetch.error"]against afetchnode with noraises:passes validation and onlyKeyErrors at runtime via_add_agent_inputinengine/context.py. -
resumeexit-code-3 path.cli/app.py:930-933mirrors therunexit-code handling forUnhandledWorkflowError, but onlyrunis tested (test_cli/test_run.py:988-1026). Diverging exit codes betweenrunandresumewould break CI scripts that key on exit 3 for typed halts.
| self._errors_jsonl_path = errors_path | ||
| # Attach the path to the exception itself so the CLI handler | ||
| # can render it without needing a reference to the engine. | ||
| e.errors_jsonl_path = errors_path # type: ignore[attr-defined] |
There was a problem hiding this comment.
setattr side-channels on exceptions defeat the type system.
Two patterns here that are mechanically the same problem:
e.errors_jsonl_path = errors_path # type: ignore[attr-defined]onUnhandledWorkflowError(this site), read back viagetattr(error, "errors_jsonl_path", None)incli/app.py:190.exc._envelope = normalized # type: ignore[attr-defined]onExecutionErroratworkflow.py:3758and:4214, read back viagetattr(result, "_envelope", None)atworkflow.py:3853, 3909, 4310, 4343.
Both are undeclared attributes that the type checker can't see. A rename or typo in any of the 4 read sites silently degrades to "no envelope" / "no log path" with no test failure or type error.
Mechanical fix: add typed constructor params:
UnhandledWorkflowError(..., errors_jsonl_path: Path | None = None)ExecutionError(..., envelope: ErrorEnvelope | None = None)(importable underTYPE_CHECKINGto avoid the cycle, sinceerrors.pyis a leaf module)
This removes both # type: ignore[attr-defined] writes and all six defensive getattr(..., None) reads, and makes the carriers checkable end-to-end.
Related (same theme but lower priority): the executor output fields ScriptOutput.error and AgentOutput.error are typed dict[str, Any] instead of ErrorEnvelope | None, which is what forces the cast("dict[str, Any]", envelope) calls in executor/script.py:242,244,251 and executor/agent.py:279,301. Tightening those two fields under TYPE_CHECKING would eliminate every cast introduced by this PR.
| f.write(json.dumps(record, default=str)) | ||
| f.write("\n") | ||
| except OSError as e: | ||
| logger.warning("Failed to write errors.jsonl at %s: %s", path, e) |
There was a problem hiding this comment.
logger.warning here is effectively invisible — route through the event system.
Per project convention, conductor never calls logging.basicConfig/addHandler; every module just does logging.getLogger(__name__). A logger.warning with no configured handler falls through to Python's lastResort handler, which writes to raw stderr at WARNING level and bypasses the Rich console, the event stream, and the --verbose flag entirely. In --web-bg it only lands in the captured .bg.stderr.log, not the dashboard or the .events.jsonl.
This is the worst possible site for that pattern: _write_errors_jsonl is the forensic artifact that pairs with .events.jsonl for post-mortem of an unhandled halt. When the write fails, the operator gets errors_jsonl_path = None rendered as Halt log: None in the CLI panel and no indication that writing actually failed (vs. simply not being attempted).
Fix: emit a WorkflowEvent (e.g. errors_jsonl_write_failed), or attach the failure reason to the existing workflow_failed event as a sibling field — either way it then renders through the same channel as the rest of the engine's diagnostics. The same concern applies to the other logger.warning sites in this file (:2026, :3334, :3376), but this one is on the error-routing critical path.
Phase 1 of the typed
on_errorrouting design from #227 (RFC).Companion PR to #227 (brainstorm spec). Spec excerpts inline below;
full design and open questions live in the RFC.
What ships in this PR (Phase 1 scope from the RFC)
CONDUCTOR_ERROR_OUTpoints at a tempfile the script writes a typed envelope into; engine reads on exit. Works
uniformly for
pwsh,bash,python,node,dotnet(dotnet run --style).
{conductor_error: true, kind, message, details}in agent output is treated as a raise rather than a regular response.
RouteDef.on_error: bool | str | list[str]— routes can match onthe failing node''s typed error kind.
{{ failing_node.error }}template scope — handler nodes get theenvelope of the node that raised, alongside its (possibly partial) output.
halts; engine writes
errors.jsonl(TMPDIR pattern, alongside the eventlog) and emits a typed
workflow_failedevent. CLI maps the exception tonew exit code 3.
AgentDef.raises: list[str]— declared kinds are lintedagainst route
on_errordeclarations at validation time and checked atruntime; undeclared kinds are wrapped as
internal.undeclared_kind.src/conductor/helpers/error/forpwsh / bash / python / node / dotnet (ship in the wheel).
What''s reserved for Phase 2/3 (out of scope here)
retry/halt/propagateroute actions.on_erroron routes fromtype: workflow,human_gate,notification,and parallel/for_each groups is currently a hard validation error
(avoids silent "handler that never fires" footguns). These will become
valid in Phase 2.
Reserved kinds emitted in Phase 1
internal.script_error— script exited non-zero AND wrote no envelope(opt-in: only synthesized when the node has
raisesor anyon_errorroute, so legacy
exit_code-routing workflows are unaffected).internal.schema_violation— agent output failed itsoutput:schema.internal.undeclared_kind— node withraises:raised a kind not inits list; original kind preserved under
details.original_kind.Reserved kind prefixes (validator forbids users declaring these):
internal.,provider.,subworkflow.,retry..How to read this PR
The 14 commits map 1:1 to the implementation steps from the plan and are
designed to be reviewed in order. Each is independently testable and
green:
feat(schema)config/schema.pyfeat(engine)ErrorEnvelope typesengine/errors.py,exceptions.pyfeat(router)success vs. error bucketsengine/router.pyfeat(context)store_error + .error accessengine/context.pyfeat(validator)cross-checkconfig/validator.pyfeat(agent-exec)envelope pathexecutor/agent.pyfeat(script-exec)CONDUCTOR_ERROR_OUTexecutor/script.pyfeat(engine-wire)leaf-error pathengine/workflow.pyfeat(halt-jsonl)errors.jsonl + eventengine/workflow.pyfeat(cli-exit)exit code 3cli/app.pyfeat(helpers)5 language helpershelpers/error/*feat(examples)examples/error-routing*.yamltest(xeng)cross-engine envelope contracttests/test_integration/phase-1(checks)final ty/lint/test sweepExamples
examples/error-routing.yaml— script-based; uses theCONDUCTOR_ERROR_OUTcontract directly with no helper. Workflow inputsimulated_failuretoggles ok / drift / rate_limited.examples/error-routing-helpers.yaml— same shape using the shippedPython helper.
Both validate, both run on Windows and POSIX, both render
{{ failing_node.error.kind }}/.message/.detailsin thehandler''s prompt.
Test posture
non-serializable, registry/integration ×10) are pre-existing and
unchanged.
exit-code tests, 3 cross-engine integration tests (Python + pwsh run
on Windows; bash skipped on Windows by design — WSL relay shim
unreliable in CI envs).
ruff check), format clean (ruff format --check).ty check srcback to the 12-diagnostic baseline (all pre-existingWindows
termios/ttynoise).make validate-examplesequivalent green across all 17 bundledexamples including the two new ones.
Open Phase 1 micro-decisions (resolved in this PR)
halted on unhandled error).
errors.jsonlpath: same$TMPDIR/conductor/convention as the eventlog; printed at end-of-run.
errors.jsonl: single-element today, structured toaccept multi-frame in Phase 2 without a shape change.
{{ workflow.last_error }}not added (RFC says require{{ failing_node.error }}); can revisit in Phase 2.cc @jasonrobertfox — companion to the RFC at #227.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com