feat(workflows): cloudflare workflow interpreter (dag/v1)#43
feat(workflows): cloudflare workflow interpreter (dag/v1)#43yourbuddyconner wants to merge 78 commits into
Conversation
|
Preview deployment: https://pr-43.dev-valet-turnkey-client.pages.dev |
Spec updates per PR #43 review feedback: - Add `foreach` node. Bounded iteration over an array, inline body node of a restricted type set, configurable concurrency, on-item-error handling (fail/skip/collect), per-item step naming for retry granularity, and policy-enforced maxItems. No nested foreach. - LLM node output contract is now JSON-only. Models are always prompted for structured output. `outputSchema` is the optional validation layer; without one, any JSON value is accepted. No `{ text: string }` fallback — author a schema with a `text` field if you want a text return. Adds a step-result size note pointing at R2 staging for oversized outputs. - Add Cancellation Semantics section. State machine extended with a `cancelling` transient. Per-node behavior table covers in-flight tool/llm completion, pending approvals (D1 status flip + workflow termination), step.sleep abandonment, foreach mid-iteration, and most importantly the session node's spawned session (sessions.abort with a 5s deadline so the child doesn't keep running after its parent workflow is gone). Explicit list of what cancel does NOT undo (committed side effects, sent messages, fire-and-forget orchestrator dispatches). - Add Hibernation And Long-Running Waits section. session/orchestrator until_idle waits use bounded polling with exponential backoff (max 5min intervals, capped by node timeout) and a getStatus call that hydrates the session DO without forcing a wake. "session is gone" path fails the node explicitly. Determinism rules for replay across hibernation. - Drop `merge` node. Same data is reachable via `nodes.<parentId>.data` from any downstream node. Removed from goals, node-types section, palette defaults, inspector forms, node-card signals, and testing section. Updated execution status transition diagram, node palette listing, palette defaults, inspector form list, and test coverage to match.
Replace the runner-driven workflow runtime with a Cloudflare Workflow entrypoint (ValetWorkflowInterpreter) interpreting dag/v1 definitions. Author-facing model: draft → validate → publish → version history; runtime: a wave-loop DAG of typed nodes (llm / tool / set / if / wait / approval / foreach / orchestrator / session / stop) executing inside step.do with cancellation, hibernation, and approval gates handled natively. Highlights - ValetWorkflowInterpreter + step.do wave loop with deterministic per-node tracing in workflow_execution_nodes. - Draft + workflow_definition_versions lifecycle; routes for sync, draft CRUD, validate, publish, test-run, restore. - dag/v1 validator (Zod + semantic) + env-dependent validation at publish + execution-create. - Approvals via workflow_approvals + step.waitForEvent + instance.sendEvent; flat + nested approve/deny endpoints; stuck- approval sweep filters on active execution status. - Cancellation pipeline: cancel API → cancelling CAS → terminate() → cleanup gated on cleanup_completed_at; wave loop probes cancel state per iteration so in-flight cancels don't dispatch new side-effects; sweep retries terminate() with status() guard for already-terminal instances; cron sweep covers both cancelling and cancelled-without- cleanup rows. - Tool node: action_invocations enter pending; markExecuted / markFailed CAS-guard on pending/approved so cancel-set 'failed' isn't clobbered. - Webhook triggers gate on a server-issued one-time token (X-Valet-Trigger-Token); CORS allow list updated; rate-limit per trigger; trigger UI surfaces token once at create and preserves config on PATCH. - Per-user + global execution concurrency caps enforced in createExecution; test-run goes through the same gate. - Client: workflow detail with executions + pending-approval panel on any active row; executions list with inline approval action; trigger CRUD with one-shot token reveal; workflow Published vs Draft badge driven by published_version_id. - Remove runner-side workflow engine + workflow-cli + gateway routes and dependents; drop legacy workflow_executions columns + the workflow_execution_steps table; drop dead lib/db/executions helpers and reconcileWorkflowExecutions; remove resumeToken / variables mirror; reject legacy repoUrl/branch/ref/sourceRepoFullName on manual + trigger run. - Docs: docs/specs/workflows.md authoritative; architecture.md, real-time.md, sandbox-images.md, runner-and-opencode.mdx, workflow-execution.mdx updated to dag/v1 present tense; product docs (built-in-tools.mdx, plugin-workflows skill, opencode.json instruction) point at the web UI for workflow management. Tests: 951 passing across 75 files; major adds — wave-loop cancel race (before-startup and in-flight), audit-row ordering, sweep retry, foreach body edge rejection, schema legacy-field rejection, approval idempotency, parallel-sibling approval race.
659c6f6 to
7795cda
Compare
Reconcile with main's per-channel dispatch + orchestrator thread-origin
grouping work. Conflicts:
- backend/images/base.py: bump IMAGE_BUILD_VERSION combining both changes.
- packages/runner/src/agent-client.ts: keep main's per-channel sendAborted
signature; drop sendWorkflowExecutionResult (runner workflow runtime
removed on this branch).
- packages/worker/src/durable-objects/prompt-queue.{ts,test.ts}: take main's
version. workflow_execute queue type becomes dead code on this branch
(workflows go through env.WORKFLOW_INTERPRETER.create) but is harmless;
per-channel dispatch + replaceable column are real new functionality.
- packages/worker/src/durable-objects/session-agent.ts: take main's then
strip the workflow runtime call sites (WorkflowExecutionDispatchPayload
import, session-workflows.js imports, workflow-* handlers,
/workflow-execute endpoint, handleWorkflowExecuteDispatch, queue-drain
workflow_execute branches, workflow-execution-result handler).
- packages/worker/src/services/session-workflows.ts: stays deleted.
Also:
- Add workerOrigin param (unused) to runTrigger + runWorkflowManually so
main's new signature matches; workflow dispatch on this branch goes
through WORKFLOW_INTERPRETER.create and doesn't need the worker URL.
- Fix triggers.test.ts mock to target updateTriggerLastRunUnchecked
(this branch's variant) instead of updateTriggerLastRun.
- Drop the workflow_execute exclusivity test from session-agent.test.ts
— the queue path no longer exists.
Tests: 981 passing across 78 files.
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
A trigger created with the token model but no legacy config.secret was still reachable via /webhooks/<path> with zero auth — the path-based handler only required config.secret to be set, and tokenized triggers have no reason to set one. Refuse the path-based route entirely when the trigger row has a webhook_token. Returning 404 (not 401) keeps the trigger's existence opaque to probes. The /api/triggers/:id/webhook token URL is the only supported entry once a token has been minted; migration 0020 backfilled tokens for every existing webhook trigger, so this closes the path-based route for them all by design (operators surface the token out-of-band per the migration comment). Also: - Surface /api/triggers/:id/webhook in the trigger UI's describeTrigger helper instead of /webhooks/<path>. - Guard 0019 against malformed historical workflow JSON with json_valid(data) so the migration drops bad rows instead of aborting on a json_extract parse error. Tests: 984 passing (added regression for the path-based-bypass + a sanity check that the secret-check path still applies for legacy rows).
Collapse 10 incremental migrations (0018_workflow_dag_v1 through 0027_drop_legacy_workflow_execution_shape) into a single 0020_workflows_dag_v1 that establishes the end-state directly. None of these had been applied to dev or prod, so there's no migration history to coordinate. - workflow_executions: created in final shape; eight legacy columns never exist - workflow_spawned_sessions: created with expires_at from the start - action_invocations: still rebuilt to preserve real session audit data across the session_id NOT NULL relaxation - Trigger webhook_token, rate-limit table, path uniqueness, and the workflow-delete RESTRICT trigger all fold in
…s NO_RETRY
Restore the public delete contract: deleteWorkflow no longer refuses a
workflow that has published versions. The underlying FK in
workflow_definition_versions (now in 0020_workflows_dag_v1) is
ON DELETE CASCADE, so deleting a workflow drops its version rows.
Execution history still survives via workflow_executions.workflow_id
ON DELETE SET NULL + per-row definition_snapshot.
deleteWorkflowByIdOrSlug now `.run()`s the delete so the changes count
is portable across D1 and better-sqlite3.
Foreach body llm matches the top-level runtime's NO_RETRY policy.
Top-level llm nodes opt in to NO_RETRY in runtime.ts so a transient
model error doesn't duplicate billed work via CF's default 5-retry
policy. The foreach wrapper was calling step.do without retry config
and defaulting to 5 retries for llm bodies — same hazard. Pass
{ retries: NO_RETRY } only when the body is llm.
Cherry-picked from local f563ec49; the original commit's
0028_workflow_delete_cascades_versions.sql migration is dropped here
because the consolidated 0020_workflows_dag_v1 never added the
RESTRICT trigger it would have removed.
- backend/images/base.py: bump IMAGE_BUILD_VERSION to 2026-06-22-v54-workflows-dag-v1-merge (clears both sides) - pnpm-lock.yaml: regenerated against merged package.json set - content-registry.ts: regenerated via make generate-registries to pick up main's Slack usergroup skill updates
- Create dialog: name auto-populates the slug (kebab-case) until the
user manually edits it.
- Anchor handles bumped from React Flow's ~7px default to 14px for a
bigger hit target when dragging connections. Applies to both the
editor and the execution viewer.
- Node inspector header replaces the bare "Selected node" label with a
type-aware heading + description and explains what the node ID is
for (template references via ${nodes.<id>.output}).
- Trigger pane copy rewritten for plain language; "Workflow
entrypoint" → "How the workflow starts", "Trigger data schema" →
"Expected input fields", and the summary swaps "Runtime trigger
payload and declared parameters" for "Where the workflow starts and
what data it receives".
Names are descriptive, so a duplicate is a yellow warning ("you can
still proceed; the slug is what uniquely identifies it"). Slugs are
the real identifier — the DB enforces UNIQUE(user_id, slug), so a
client-side duplicate slug is a red error and disables submit.
- Auto-collapse the picker when the inspector opens. Picking a node on the canvas while the picker was open used to leave both panels visible and overlapping. - New nodes land adjacent to the anchor instead of the viewport center: the picker-open click stashes the prior selection, and handleAddNode places the new node one column-step to the right at the same y. With no selection, it places to the right of the rightmost non-trigger node. Collisions nudge down by 140 units until clear.
Adds a small plus affordance on every unconnected source handle. For default sources it sits at the right edge; for if-node branches it sits past the true/false label. Clicking opens the existing node palette in wire-on-add mode: the chosen type lands adjacent to the source (via the existing placement helper) and a typed edge is drawn from the captured source handle to the new node. The fresh edge runs through applyDefaultDataFlowForConnection so it inherits the same template-source defaults as a user-dragged connection.
Correctness:
- Clear wireOnAddSource + lastSelectedBeforePicker whenever the picker
closes (Effect on nodePaletteOpen). Previously dismissing the picker
via Escape, X, toolbar toggle, or auto-collapse left the wire ref
dangling, so the NEXT unrelated add would silently create an edge
from the long-gone source.
- handleAddNode uses functional setNodes/setEdges so a concurrent state
update can't drop the append. Both wire and non-wire branches
compute id/position from currentNodes inside the updater.
- Schedule fitView({nodes:[{id}]}) on every add so a node placed past
the current viewport doesn't look like the click did nothing.
- Create-workflow Submit gates on workflowsQuery.isLoading. Previously,
if the cached list hadn't loaded yet, the duplicate-slug check
returned false (empty array) and the user could submit a server-rejected
duplicate without ever seeing the in-form hint.
Cleanup:
- Export LAYOUT_COLUMN_GAP from workflow-editor-model; visual editor
drops the 320 magic-number that contradicted the actual 340 gap.
- Replace normalizeSlug with the existing slugify in @/lib/format.
- Extract createWorkflowFlowEdge — handleConnect and the wire-on-add
path now share the edge-shape contract.
- Export getSourceHandleTopPercent from ai-elements/node; the plus
button reads the same handle positions ai-elements uses to render.
Restructure packages/shared/src/types/workflow-dag.ts → workflow-dag/ directory. Each of the 11 node types now owns its own file under nodes/<type>.ts with the interface, default factory, and a NodeDocs entry — one place to read the contract, the runtime shape, and the author-facing prose. The aggregation in nodes/index.ts builds the discriminated union and the NODE_DOCS / createDefaultWorkflowNode registries. Record<DagNodeType, NodeDocs> means a new node type without docs is a TypeScript error. NodeDocs holds label + description + longDescription + sparse field helps + optional gotchas. Author-facing content lives next to the type definition; field helps are SPARSE — only the fields that need clarification get an entry. The inspector renders an info icon only where an entry exists. First-pass content written for all 11 types (longDescription, the specific field tooltips that aren't self-evident, and gotchas where runtime behavior would surprise a reader). Expect copy edits. workflow-editor-model.ts derives NODE_LABELS and NODE_DESCRIPTIONS from NODE_DOCS, and getDefaultNodeForType delegates to the new registry's createDefaultWorkflowNode.
Adds packages/client/src/components/ui/info-tooltip.tsx — a small
button that renders an "i" badge next to a field label and a Radix
tooltip on hover or focus. The component returns null when no help is
passed, so call sites can drive it unconditionally from NODE_DOCS:
<Field label="Model" help={NODE_DOCS.llm.fields?.model?.help} />
The icon only renders for fields that actually have an entry in the
sparse fields registry — labels without docs stay clean.
Threaded the `help?` prop through the per-type inspector wrappers
(Field, NumberField, SelectField, KeyValueEditor, JsonValueField,
WorkflowSchemaFields, RepoFields, WaitPolicyFields, ForeachBodyField,
and the tool/service pickers) and wired NODE_DOCS lookups at every
field call site across the 11 TypeFields components.
Adds a left-side docs drawer driven entirely by NODE_DOCS in @valet/shared. Toggled from a new "i" button in the top-right toolbar (next to JSON / +). The drawer renders every node type with its description, gotchas (amber callout), longDescription (markdown via the existing MarkdownContent renderer), and a field reference of every entry in NodeDocs.fields. A search input filters by type, label, description, long-form prose, field name, field help, or gotcha text. When opened with a node selected, the drawer scrolls to and highlights that type — clicking the "i" while inspecting an LLM node lands directly on the LLM section.
Correctness:
- InfoTooltip click handler now stopPropagation — these buttons live
inside <label> wrappers (LabelText, CheckboxField), and bubbled
clicks were activating the labeled checkbox or focusing the
labeled input on every help-icon click.
- NodeDocsDrawer search no longer waits for SearchInput's default
300ms debounce. In-memory filter over 11 entries felt broken.
- Drawer renders longDescription through DeferredMarkdownContent so
the markdown bundle (react-markdown + remark + rehype-sanitize +
shiki + mermaid) doesn't load eagerly on the editor route.
Altitude:
- NodeDocs is now generic in TNode. fields is keyof-checked so
renaming an interface field or typo'ing a docs key fails the
build at the docs entry. Hardest piece was SessionNode (a
discriminated union) — distributive `keyof` would compute the
right keys but TypeScript re-distributed the surrounding record
type, producing a union of records instead of one. Switched to
`keyof UnionToIntersection<T>` which flattens the union into
an intersection so keyof yields a non-deferred string union.
- Per-node docs entries are now typed `NodeDocs<TheirNode>`.
- Drawer holds the bare `NodeDocs` alias and casts at the two
Object.entries iteration sites to a permissive record shape.
Cleanup:
- session.fields gains a prompt entry (parity with orchestrator).
- NODE_DEFAULT_FACTORIES.session drops the no-op `as` cast that
CLAUDE.md rule 3 forbids; SessionNode and
Extract<WorkflowNode,{type:'session'}> are exactly equal.
- getDefaultNodeForType deleted; 4 call sites and 11 test
assertions import createDefaultWorkflowNode from @valet/shared.
- LabelText only switches to inline-flex layout when a help icon
actually renders; bare labels keep their original block layout.
…nodes - Trigger: payload shape examples per trigger type; advisory-schema gotcha. - Foreach: explain "step type" with per-type guidance; rename inspector "Type" select to "Step type" with tooltip. - Stop: disambiguate output (machine-readable return) vs message (human status line). - Approval: clarify summary (list title) vs prompt (decision question) vs details (supporting context); ISO 8601 examples.
- Persona free-text input → dropdown sourced from /api/personas (default + per-org list).
- Session start runtime validates repo.url (http/https/ssh/git@) and repo.sourceRepoFullName ("owner/repo") before calling createSession; bad values fail the run with a clear message instead of surfacing later as obscure clone errors.
- Expand session longDescription: start vs prompt mental model, workspace vs title disambiguation; add personaId help.
…dation, 1-click add-next - LLM inspector: above-the-fold (user prompt, model, output schema); system prompt, temperature, max output tokens collapsed into an "Advanced" disclosure (auto-open when any are set). - Tool inspector: selecting an action now seeds params with its declared input keys (preserves existing values); retries moved into Advanced. Clarify summary help copy. - Wait inspector: dropped redundant Mode field; duration input now flags bare scalars and other malformed values inline. - 1-click downstream-add: inspector renders an "Add next step" tile grid for the high-frequency types; new onAddNextDirect skips the palette and wires the node straight to the selected source handle.
Workflow tool node now renders typed input parameters for every action without each plugin hand-authoring a JSON Schema:
- Zod → JSON Schema converter (lib/zod-json-schema.ts, 10 tests) that walks ZodObject shapes and emits {type, properties, required}. Covers ZodString/Number/Boolean/Enum/Literal/Array/Record/Union/Optional/Default/Nullable/Native enum.
- Catalog endpoint (/api/integrations/actions) falls back to zodToJsonSchema(action.params) when an action lacks an explicit inputSchema.
- MCP tool cache (migration 0021) gains input_schema + output_schema TEXT columns; upsert serializes JSON, list parses with a tolerant fallback so a single bad row can't poison the listing. McpTool/McpActionSource carry outputSchema through.
- session-tools propagates MCP-advertised schemas into cache rows so the catalog can surface them on unauthenticated browse.
Output schemas for native plugins remain a separate per-action authoring task (see TODO).
Tool node inspector now renders the response shape for these native plugins, matching the GitHub pattern. - plugin-slack: 17/17 actions. Shared slimMessage/slimChannel/slimUser/slimUsergroup schemas + slackPostMessageOutputSchema for send-message-style returns. Covers DMs, channel I/O, history/thread reads, user/usergroup ops, reactions, pins, channel info, file fetch. - plugin-gmail: 12/12 actions. Shared header / message-list / draft-list / triage-message schemas. Covers send, list/get message, label ops, trash, draft CRUD + send, list_labels, triage_inbox. - plugin-google-calendar: 5/5 actions. Shared event/dateTime/attendee schemas. Covers list/create/update/delete/quick-add. Google Workspace (Drive + Docs + Sheets, ~63 actions) deferred to a follow-up.
partId was missing from the schema; attachmentId description now explains the bytes require a follow-up call (no plugin action wraps that today).
deploy.sh in this branch made API_PUBLIC_URL mandatory and dropped the WORKER_PROD_URL fallback, but the preview workflow still only passed WORKER_PROD_URL. Now the heredoc writes API_PUBLIC_URL=<DEV_WORKER_URL> and fails fast with an actionable error if the GitHub Actions variable isn't set.
The DEV_ prefix was redundant — each GitHub Actions environment already scopes its own variables, and deploy.sh consumes API_PUBLIC_URL by that name everywhere. One name, one purpose, no in-workflow rename. - deploy-preview.yml / deploy-dev.yml / deploy-prod.yml: read vars.API_PUBLIC_URL, fail fast with an actionable message if unset, write it through to .env.deploy.<env> unchanged. - CLAUDE.md: update the required-config table; clarify the variable feeds .env.deploy.<env>, deploy.sh, the Worker runtime, and VITE_API_URL. To unblock CI: set API_PUBLIC_URL as a GitHub Actions variable on each environment (Settings → Environments → dev/prod → Variables).
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Summary
Replace the runner-driven workflow runtime with a Cloudflare Workflow entrypoint (
ValetWorkflowInterpreter) interpretingdag/v1definitions. Author-facing model: draft → validate → publish → version history; runtime: a wave-loop DAG of typed nodes (llm/tool/set/if/wait/approval/foreach/orchestrator/session/stop) executing insidestep.dowith cancellation, hibernation, and approval gates handled natively.What's in this PR
Runtime + lifecycle
ValetWorkflowInterpreter+ step.do wave loop with deterministic per-node tracing inworkflow_execution_nodes.workflow_definition_versionslifecycle; routes for sync, draft CRUD, validate, publish, test-run, restore.dag/v1validator (Zod + semantic) + env-dependent validation at publish + execution-create.createExecution(test-run goes through the same gate).Approvals
workflow_approvals+step.waitForEvent+instance.sendEvent; flat + nested approve/deny endpoints.onConflictDoNothinginsert so step.do replay can't crash on a duplicate-key race.running).Cancellation (fail-closed)
cancellingCAS →terminate()→ cleanup gated oncleanup_completed_at.terminate()with astatus()guard for already-terminal instances; cron sweep covers bothcancellingandcancelled-without-cleanup rows.Tool node
action_invocationsenterpending;markExecuted/markFailedCAS-guard onpending/approvedso a cancel-setfailedrow isn't clobbered.Triggers + webhooks
X-Valet-Trigger-Token); CORS allow list updated.rateLimiton PATCH.repoUrl/branch/ref/sourceRepoFullNameon/manual/run+/:id/run.Client
waiting_approval).published_version_id.Cleanup
workflow_executionscolumns (variables,steps,workflow_hash,workflow_snapshot,attempt_count,session_id,runtime_state,resume_token) and theworkflow_execution_stepstable.lib/db/executionshelpers andreconcileWorkflowExecutions.resumeToken/variablesAPI mirror.Docs
docs/specs/workflows.mdauthoritative;architecture.md,real-time.md,sandbox-images.md,runner-and-opencode.mdx,workflow-execution.mdxupdated todag/v1present tense.built-in-tools.mdx, plugin-workflows skill,opencode.jsoninstruction) point at the web UI for workflow management. Sessions don't currently have a wired API path — tool-based management is a follow-up.Tests
981 passing across 78 files. Notable additions:
failedpreserved).Test plan
scripts/smoke-dag.sh)