Skip to content

fix: report suspends as PENDING to plugins#492

Open
yaythomas wants to merge 2 commits into
mainfrom
fix/suspend-pending-outcome
Open

fix: report suspends as PENDING to plugins#492
yaythomas wants to merge 2 commits into
mainfrom
fix/suspend-pending-outcome

Conversation

@yaythomas

Copy link
Copy Markdown
Contributor

Closes #491

Problem

In the experimental OpenTelemetry plugin, a durable execution that suspends inside a child context (for example a child context that calls context.wait(...)) produced a false fault on the child-context span in X-Ray, with a recorded exception named TimedSuspendExecution. The execution behaved correctly. A suspend is normal durable control flow, not a failure.

Root cause

The producer in state.py wrap_user_function reported a suspend to instrumentation plugins using the concrete exception name:

ErrorObject(type=type(e).__name__, ...)  # "TimedSuspendExecution"

UserFunctionOutcome.from_error matched only the base class name (SuspendExecution), so the suspend fell through to FAILED instead of PENDING. Since TimedSuspendExecution is the type every timed wait raises, the PENDING branch was never reached for that case. The OTEL plugin then rendered FAILED as a span error, which X-Ray shows as a fault. Top-level waits are unaffected because a top-level suspend resolves to InvocationStatus.PENDING at the invocation level.

Change

  • Add an internal PluginExecutor.on_user_function_suspend(start_info) that reports a PENDING outcome with no error, dispatched through the existing public on_user_function_end.
  • Route the except SuspendExecution branch in wrap_user_function to it instead of fabricating an error.
  • Simplify UserFunctionOutcome.from_error to map None to SUCCEEDED and any other error to FAILED, removing the fragile name match.

This is an internal change. The public plugin interface already exposes UserFunctionEndInfo.outcome and PENDING, so there is no public API change.

Testing

  • New regression test test_wrap_user_function_suspend_reports_pending_outcome fails on main (FAILED vs PENDING) and passes with this change.
  • New unit tests cover from_start_info_suspended, on_user_function_suspend, and the simplified from_error.
  • Core SDK suite: 1276 passed. OTEL package suite: 48 passed. Typecheck clean. hatch fmt --check clean.

A user function that suspends (for example a child context that waits)
was reported to instrumentation plugins as a FAILED outcome. The OTEL
plugin then recorded the suspend as a span error, which X-Ray surfaced
as a fault on the child-context span.

The producer emitted the concrete exception name (TimedSuspendExecution),
but UserFunctionOutcome.from_error matched only the base class name
(SuspendExecution), so the suspend fell through to FAILED instead of
PENDING.

Route suspends through a dedicated PluginExecutor.on_user_function_suspend
that reports a PENDING outcome with no error. Simplify from_error to map
None to SUCCEEDED and any other error to FAILED. This is an internal
change with no public plugin API impact.

Closes #491
@yaythomas yaythomas added the otel-plugin related to the otel-plugin package label Jun 25, 2026
Add 'from __future__ import annotations' and drop the forward-reference
string quotes on the self-referential return types, matching the rest of
the package.
@github-project-automation github-project-automation Bot moved this from In review to Pending merge in aws-durable-execution Jun 25, 2026
start_time=start_info.start_time,
is_replay_children=start_info.is_replay_children,
attempt=start_info.attempt,
outcome=UserFunctionOutcome.PENDING,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On SuspendExecution, SDK calls CheckpointDurableExecution with STARTED, so we might want to keep this Started, and later it will emit completed from list of operations from backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

otel-plugin related to the otel-plugin package

Projects

Status: Pending merge

Development

Successfully merging this pull request may close these issues.

OTEL plugin marks child-context spans as faults when they suspend (timed wait misclassified as FAILED)

3 participants