Skip to content

[RFC] retry: route action — Phase 2 of error-routing (companion to #229)#236

Draft
PolyphonyRequiem wants to merge 1 commit into
microsoft:mainfrom
PolyphonyRequiem:proposal/retry-route-action
Draft

[RFC] retry: route action — Phase 2 of error-routing (companion to #229)#236
PolyphonyRequiem wants to merge 1 commit into
microsoft:mainfrom
PolyphonyRequiem:proposal/retry-route-action

Conversation

@PolyphonyRequiem

Copy link
Copy Markdown
Member

Summary

Phase 2 RFC for the error-routing work in #229: adds a retry: route action that re-executes the same node on failure before escalating to a fallback error route.

This is a design document only — no implementation code. Filed as a DRAFT PR so it shows up in the conductor PR list alongside #229 and #227.

Design document: docs/proposals/retry-route-action.md


Motivation

PR #229 adds on_error: routing (catch and route elsewhere). It explicitly defers retry: to Phase 2. The gap: 14 of 19 polyphony AB#3257 human-gate nodes need retry-before-escalation semantics for idempotent infrastructure operations (git push, PR open, merge poll). Without retry:, transient network failures abort entire workflow runs; operators must re-trigger manually.

What this PR contains

  • docs/proposals/retry-route-action.md: full design covering schema, engine behavior, sub-workflow and for_each interaction, test plan, and open questions.

Sequencing

Phase 1: PR #229 merges (blocked on context.py sentinel conflict)
  ↓
Phase 2: This RFC approved
  ↓
Phase 2 implementation PR
  ↓
Polyphony YAML retrofit: 14 idempotent-retry gates removed (AB#3257)

Key design decisions (details in doc)

  • retry: only valid on error routes (on_error: required alongside)
  • to: optional when retry: is set (lint warning if both present)
  • max counts re-runs only (not first attempt): max: 3 = 4 total executions
  • Backoff: fixed or exponential, with initial_seconds and jitter (default ±25%)
  • Exhaustion: falls through to next matching on_error: route in document order
  • CONDUCTOR_RETRY_ATTEMPT env var + {{ conductor.retry_attempt }} template context
  • Sub-workflow retry: re-runs entire child workflow (Phase 1 validator restriction lifted)
  • for_each retry: per-iteration only (failed item retried, rest unaffected)

cc @jasonrobertfox — companion to #227 (RFC) and #229 (Phase 1).

Filed by Mahler (Conductor Expert) on the polyphony squad.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@e6f8bd7). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #236   +/-   ##
=======================================
  Coverage        ?   88.39%           
=======================================
  Files           ?       63           
  Lines           ?    10553           
  Branches        ?        0           
=======================================
  Hits            ?     9328           
  Misses          ?     1225           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants