Skip to content

Add direct-API and codex review runners with live logging#19

Merged
sergeyfast merged 18 commits into
masterfrom
feat/direct-api-runner
Jun 13, 2026
Merged

Add direct-API and codex review runners with live logging#19
sergeyfast merged 18 commits into
masterfrom
feat/direct-api-runner

Conversation

@sergeyfast

Copy link
Copy Markdown
Contributor

Summary

  • Add direct-API runner (--runner direct): drives an LLM through the Messages/Chat API directly (no CLI) with a narrow review tool set (read/grep/glob/git_diff/ast_*) plus a submit_review tool. New pkg/reviewer/direct package — agent loop with parallel tool dispatch (loop.go), diff + changed-file preload (preload.go), per-round usage/cost accounting, mid-run compaction, and a submit_review that always produces a filled review.json (no Step-2 retry needed).
  • Add two providers behind a LLMProvider interface, selected by --api-provider: native Anthropic (provider_anthropic.go, streaming via Messages.NewStreaming + Accumulate, rolling prompt-cache breakpoint, adaptive thinking preserved through Raw) and OpenAI-compatible (provider_openai.go, covers deepseek and openai-compat). Pricing/effort resolved in provider_factory.go, incl. DeepSeek V4 (deepseek-v4-pro/deepseek-v4-flash).
  • Add codex runner (--runner codex): wraps codex exec --json (runner/codex.go), parses the JSONL event stream, and estimates cost from published token rates since codex reports none; --model defaults to gpt-5.1-codex.
  • Extract all runners into pkg/reviewer/runner (claude/opencode moved, codex/direct new), all behind the ReviewRunner interface with compile-time assertions. Controller now depends on runner.ReviewRunner.
  • Live event logging for every runner: claude switched to --output-format stream-json --verbose, and claude/opencode/codex/direct now surface tool calls live to the reviewctl log while still writing the full transcript to *-output.json(l).
  • Cross-runner --effort knob (low..max): Anthropic + codex (-c model_reasoning_effort) honor it; OpenAI-compatible providers ignore it. Direct-runner API key resolves per provider (REVIEW_API_KEYANTHROPIC_API_KEY/DEEPSEEK_API_KEY/OPENAI_API_KEY), see directKeyEnvs.
  • Reliability fixes surfaced by self-review: symlink sandbox escape and compaction role collision (fe1e643), bare-assistant-turn 400 from V4 (0c82d0c), recover around panicking tools in parallel dispatch (451f049), large-diff preload overflow with git_diff(path=...) hint (b0ae204), streaming to avoid HTTP read timeouts (ab06be4).
  • Frontend: render mermaid per-diagram and fall back to a small notice + raw source on invalid (often LLM-generated) syntax, instead of mermaid's full-page error graphic (MarkdownContent.vue).
  • Housekeeping: README documents the four runners; anthropic-sdk-go + go-openai promoted to direct deps; local-only DirectAPIRunner planning docs git-ignored.

Test plan

  • make fmt lint test — green
  • reviewctl review --runner direct --api-provider anthropic --model claude-opus-4-8 --effort xhigh — submits a filled review.json, live direct tool events in the log, transcript in direct-output.jsonl
  • reviewctl review --runner direct --api-provider deepseek --model deepseek-v4-pro — completes without the bare-assistant 400; cost/tokens recorded
  • reviewctl review --runner codex --model gpt-5.1-codex — parses JSONL, estimates cost, posts issues
  • reviewctl review --runner claude — stream-json parsed (last result line), live claude tool events, normal upload/comment flow intact
  • reviewctl review --runner opencode -m openrouter/deepseek/deepseek-v4-flash — live opencode tool events, review submitted
  • UI: a review with invalid mermaid shows the inline ⚠ fallback, not the full-page error

- Add pkg/reviewer/direct: in-process agent loop that calls the LLM
  API directly (DeepSeek/OpenAI-compatible) with a narrow review tool
  set — read_file, read_files, glob, grep, git_diff and ast_* — instead
  of shelling out to the claude/opencode CLIs
- Stream the review output via set_group/add_issues/submit_review so
  each call stays small (a monolithic payload overflows DeepSeek's
  output cap and arrives as truncated JSON); tolerate stringified args
- Wire DirectRunner (--runner direct) that fills the existing
  ClaudeResult/ReviewModelInfo; measure tokens and cost per round
- Pre-load the diff and changed-file contents into the kickoff and
  dedup repeated reads to cut round-trips
- Rebuild the ast-index automatically before review; ast_* tools
  degrade to grep when the binary is absent
- Stream the session transcript to direct-output.jsonl and add it to
  the debug bundle
- Fix MarkdownContent.vue to render mermaid per block with a fallback
  on syntax errors instead of the full-page error graphic
- Add github.com/sashabaranov/go-openai dependency and vendor it
- Add anthropic-sdk-go provider (M1) with prompt caching, adaptive
  thinking and effort, replacing the not-implemented stub
- Cache the whole message history via a rolling cache breakpoint so
  the preload and tool results are not re-billed every round (fresh
  input dropped from 222k to ~40 tokens on an Opus run)
- Preserve signed thinking blocks across rounds by replaying the
  verbatim assistant turn (resp.ToParam) instead of rebuilding it
  from text and tool calls, avoiding lost reasoning continuity
- Fix compaction emitting two consecutive user messages, which the
  Anthropic API rejects, by folding the marker into the head message
- Default direct+anthropic to claude-opus-4-8 and effort xhigh, and
  thread --effort through to the claude CLI runner
- Exclude vendor and generated trees from git_diff and the changed
  file preload so they do not swamp the review
- Wipe a previous run's outputs (R*.md, review.json, session logs)
  before each review so the runner starts on a clean tree
- Add tests for the above and vendor anthropic-sdk-go with its deps
- Tell the model to emit independent set_group/add_issues calls in
  the same step instead of one per step, splitting only to avoid
  output truncation (keeps the DeepSeek 8k-cap safety valve)
- Collapses the output phase from ~7 rounds to ~3, so the growing
  conversation is re-billed (cache read) fewer times per run
- Price deepseek-v4-pro and deepseek-v4-flash; legacy deepseek-chat/
  reasoner now price as V4 Flash (they alias it, deprecating 2026-07-24)
- Fix openai provider sending a bare assistant message (no content,
  no tool calls) which DeepSeek rejects with HTTP 400; skip it, as the
  anthropic provider already does. deepseek-v4-pro emits such turns
Both issues surfaced by dogfooding the review runner (Opus and
DeepSeek-V4-pro flagged the symlink escape independently).

- resolveInRoot now resolves symlinks and re-checks the target stays in
  root; a symlink inside the tree pointing outside (e.g. evil.txt ->
  /etc/passwd) previously passed the lexical check and leaked host file
  contents into the review output
- compactMessages now resumes the tail on an assistant turn, so a
  mid-history user message (e.g. a submit nudge) landing on the cut
  boundary no longer collides with the kept user head into two
  consecutive user messages, which the Anthropic API rejects with 400
- Add regression tests for both
- Emit "system" and "user" events at run start so direct-output.jsonl
  is a full input/output log; it previously recorded only model output
  and tool I/O, omitting the system contract and the preloaded task
- Switch the anthropic provider from Messages.New to NewStreaming +
  Accumulate; a heavy high/xhigh round (adaptive thinking + long
  output) could exceed the non-streaming HTTP read timeout
- Usage, stop reason, content and ToParam replay are unchanged — the
  accumulated message is equivalent to the non-streamed one
- List the names of changed files that overflow the preload budget
  instead of just a count, so the model can read_files them directly
  without a discovery round
- Add a path argument to git_diff so the model can pull the full diff
  of a single file, bypassing the 300k whole-diff clip — restoring the
  diff view for files the preload could not inline
- Add validPath/pathspec tests
- Add ExecCodexRunner: shells out to `codex exec --json` in the
  workspace-write sandbox, aggregates the JSONL stream (thread/turn/
  item events) into ClaudeResult and estimates cost from tokens since
  codex reports none; modeled on a codex-exec agent
- Wire --runner codex and the RunnerCodex identifier
- Stream significant events to the runner log live: codex tool commands
  and per-turn usage, plus the direct runner's tool calls and per-round
  usage, via an optional per-line stdout tee in runExec
- Add codex parser/argv/cost tests
- Move ExecClaudeRunner/ExecOpenCodeRunner/ExecCodexRunner/DirectRunner,
  the shared ClaudeResult types and the ReviewRunner interface out of the
  overloaded ctl package into pkg/reviewer/runner, mirroring a flat
  agent layout
- ctl (Controller/Config/recovery) and main now import the runner package
- Pure move, no logic change
Fix valid review-54 finding A1: an arbitrary OpenAI-compatible endpoint
had to put its key in DEEPSEEK_API_KEY because directKeyEnv only split
anthropic vs everything else.

- Add a provider-agnostic REVIEW_API_KEY override; openai-compat now
  falls back to OPENAI_API_KEY and deepseek to DEEPSEEK_API_KEY
- Match the provider name case-insensitively (aligns with NewProvider,
  also closing the Anthropic-vs-anthropic key mismatch)
- Add cmd/reviewctl tests
- Stream significant opencode events to the runner log as they arrive
  (tool calls with a short detail, per-step usage at debug), matching
  the codex and direct runners, via the runExec per-line stdout tee
The research/spec docs are kept on disk for reference but removed from
history and ignored so they are not committed.
- dispatchParallel now recovers a panicking tool handler into an error
  result instead of crashing the review process (flagged by the
  opencode review); the sibling tools in the turn still complete
- Pin gpt-5.1-codex as the codex default model so its token-based cost
  is estimated instead of reported as 0; override with --model
- go mod tidy: anthropic-sdk-go and go-openai are imported directly, no
  longer flagged // indirect
- The direct loop now sums provider Complete time into DurationAPIMs
  instead of reporting total wall-clock (which includes local tool time)
- Switch the claude runner to --output-format stream-json --verbose so
  its NDJSON events can be surfaced live: tool calls are logged as they
  arrive, matching the codex/opencode/direct runners
- ParseClaudeResult now decodes a stream of newline-delimited objects
  (keeping the last "result"), in addition to the single-object and
  JSON-array forms; the result line carries the same fields
- Validated against a real `claude --output-format stream-json` sample
- List all four runners (claude/opencode/codex/direct) and the direct
  runner's --api-provider/--effort flags and REVIEW_API_KEY env var
- Reorder test imports to localmodule-before-default per the gci config
  (config_test.go, ctl_test.go)
- Pre-allocate the preloaded slice in the direct preload (prealloc)
- Reuse directSubtypeSuccess/directSubtypeError for runner result
  subtypes instead of "success"/"error" literals (goconst)
- Add codexExecCmd, gitNoPager and gitDiffCmd consts to drop duplicated
  command-argument string literals (goconst)
@sergeyfast sergeyfast merged commit 9f91653 into master Jun 13, 2026
3 checks passed
@sergeyfast sergeyfast deleted the feat/direct-api-runner branch June 13, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant