Skip to content

Define formal execution state machine with validated transitions #1217

Description

@geoffjay

Context

Part of #1216 (track execution state of agents). This is the foundation issue that defines the formal state model and transition rules.

Background

The orchestrator currently has two separate state concepts with no formal relationship:

  1. AgentStatus (persisted to DB): Pending, Running, Stopped, Failed
  2. ActivityState (in-memory only): Idle, Busy

Transitions between these states happen ad-hoc throughout manager.rs and websocket.rs with no validation. Any code path can set any status. There is no enforcement that, say, a Stopped agent cannot transition directly to Busy.

What to Build

1. Unified Execution State Model

Define a richer ExecutionState enum that unifies lifecycle and activity into a single state machine:

/// The execution state of an agent, combining lifecycle and activity.
///
/// This is the single source of truth for "what is this agent doing right now?"
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum ExecutionState {
    /// Agent record created, backend session not yet started.
    Pending,
    /// Backend session created, waiting for Claude to connect.
    Starting,
    /// Agent connected and waiting for input.
    Idle,
    /// Agent actively processing a prompt.
    Busy,
    /// Agent temporarily suspended (e.g., context clear in progress).
    Suspended,
    /// Agent exited cleanly.
    Stopped,
    /// Agent process failed or crashed.
    Failed,
}

2. Valid Transition Map

Define which transitions are legal:

Pending   -> Starting, Failed
Starting  -> Idle, Failed
Idle      -> Busy, Suspended, Stopped, Failed
Busy      -> Idle, Suspended, Stopped, Failed
Suspended -> Idle, Busy, Stopped, Failed
Stopped   -> Starting (restart)
Failed    -> Starting (restart)

3. Transition Function

impl ExecutionState {
    /// Attempt a state transition. Returns Ok(new_state) if valid,
    /// Err with context if the transition is illegal.
    pub fn transition_to(&self, target: ExecutionState) -> Result<ExecutionState> {
        if self.can_transition_to(&target) {
            Ok(target)
        } else {
            bail!("Invalid state transition: {} -> {}", self, target)
        }
    }

    /// Check whether a transition to `target` is valid.
    pub fn can_transition_to(&self, target: &ExecutionState) -> bool {
        // ... match on (self, target) pairs
    }

    /// Is this a terminal state (Stopped or Failed)?
    pub fn is_terminal(&self) -> bool { ... }

    /// Is the agent logically "alive" (could process work)?
    pub fn is_active(&self) -> bool { ... }
}

4. Transition Trigger Enum

Record why a transition happened:

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum TransitionTrigger {
    Spawn,              // Agent spawned
    Connected,          // WebSocket/stdio connection established
    Disconnected,       // Connection lost
    PromptReceived,     // User message sent to agent
    ResultReceived,     // Agent returned a result
    ContextCleared,     // Context was cleared
    UserTerminated,     // Explicit stop request
    ProcessExited,      // Backend process exited
    ProcessCrashed,     // Backend process crashed
    Restart,            // Agent restart initiated
    Reconciliation,     // Startup reconciliation
}

5. Backward Compatibility

The existing AgentStatus enum is used by the API, MCP tools, CLI, and UI. The new ExecutionState must map cleanly to the existing status values:

  • Pending -> pending
  • Starting -> pending (or new value)
  • Idle / Busy / Suspended -> running
  • Stopped -> stopped
  • Failed -> failed

Provide a to_agent_status() method for backward-compatible API responses. The ActivityState enum can be derived: Busy -> busy, everything else -> idle.

Acceptance Criteria

  • ExecutionState enum with 7 states defined in crates/orchestrator/src/types.rs
  • Valid transition map implemented and documented
  • transition_to() and can_transition_to() methods
  • TransitionTrigger enum for recording why transitions happen
  • to_agent_status() for backward-compatible API mapping
  • to_activity_state() for backward-compatible activity mapping
  • Display, FromStr, Serialize, Deserialize implementations
  • Unit tests for all valid transitions
  • Unit tests for all invalid transitions (rejected with error)
  • Unit tests for backward-compatibility mappings

Relevant Files

  • crates/orchestrator/src/types.rs -- ExecutionState, TransitionTrigger enums
  • crates/orchestrator/src/entity/agent.rs -- may need column type awareness
  • Existing: AgentStatus (types.rs:38-87), ActivityState (types.rs:60-73)

Stack Base

Stack on: feature/autonomous-pipeline
Parallel: no ordering constraint (this is the foundation issue)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    architectureCross-service architectural design or reviewcomplexity:largeLarge scope: 200+ lines, multiple filesenhancementNew feature or requestneeds-testsArea needs dedicated test coveragetriagedIssue has been triaged, ready for planning or implementation

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions