fix: report suspends as PENDING to plugins#492
Open
yaythomas wants to merge 2 commits into
Open
Conversation
A user function that suspends (for example a child context that waits) was reported to instrumentation plugins as a FAILED outcome. The OTEL plugin then recorded the suspend as a span error, which X-Ray surfaced as a fault on the child-context span. The producer emitted the concrete exception name (TimedSuspendExecution), but UserFunctionOutcome.from_error matched only the base class name (SuspendExecution), so the suspend fell through to FAILED instead of PENDING. Route suspends through a dedicated PluginExecutor.on_user_function_suspend that reports a PENDING outcome with no error. Simplify from_error to map None to SUCCEEDED and any other error to FAILED. This is an internal change with no public plugin API impact. Closes #491
Add 'from __future__ import annotations' and drop the forward-reference string quotes on the self-referential return types, matching the rest of the package.
SilanHe
approved these changes
Jun 25, 2026
| start_time=start_info.start_time, | ||
| is_replay_children=start_info.is_replay_children, | ||
| attempt=start_info.attempt, | ||
| outcome=UserFunctionOutcome.PENDING, |
Contributor
There was a problem hiding this comment.
On SuspendExecution, SDK calls CheckpointDurableExecution with STARTED, so we might want to keep this Started, and later it will emit completed from list of operations from backend.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #491
Problem
In the experimental OpenTelemetry plugin, a durable execution that suspends inside a child context (for example a child context that calls
context.wait(...)) produced a false fault on thechild-contextspan in X-Ray, with a recorded exception namedTimedSuspendExecution. The execution behaved correctly. A suspend is normal durable control flow, not a failure.Root cause
The producer in
state.pywrap_user_functionreported a suspend to instrumentation plugins using the concrete exception name:UserFunctionOutcome.from_errormatched only the base class name (SuspendExecution), so the suspend fell through toFAILEDinstead ofPENDING. SinceTimedSuspendExecutionis the type every timed wait raises, the PENDING branch was never reached for that case. The OTEL plugin then renderedFAILEDas a span error, which X-Ray shows as a fault. Top-level waits are unaffected because a top-level suspend resolves toInvocationStatus.PENDINGat the invocation level.Change
PluginExecutor.on_user_function_suspend(start_info)that reports aPENDINGoutcome with no error, dispatched through the existing publicon_user_function_end.except SuspendExecutionbranch inwrap_user_functionto it instead of fabricating an error.UserFunctionOutcome.from_errorto mapNonetoSUCCEEDEDand any other error toFAILED, removing the fragile name match.This is an internal change. The public plugin interface already exposes
UserFunctionEndInfo.outcomeandPENDING, so there is no public API change.Testing
test_wrap_user_function_suspend_reports_pending_outcomefails onmain(FAILED vs PENDING) and passes with this change.from_start_info_suspended,on_user_function_suspend, and the simplifiedfrom_error.hatch fmt --checkclean.