Skip to content

release: dev → main#1179

Merged
cgcardona merged 130 commits into
mainfrom
dev
Mar 16, 2026
Merged

release: dev → main#1179
cgcardona merged 130 commits into
mainfrom
dev

Conversation

@cgcardona

Copy link
Copy Markdown
Owner

Merges all recent fixes and features from dev into main:

  • Reviewer dispatch routing fix (org chart reviewer now correctly anchors to PR branch)
  • Rename pr-reviewer → reviewer everywhere
  • Persistent fastembed/ONNX volume (no more repeated 600MB downloads)
  • Local LLM token budget and reasoning_effort fixes
  • Role loader with family fallback
  • Recon prompt hardening and raw response logging
  • Docker/env config corrections
  • UI polish (inspector, activity feed, org designer)

cgcardona and others added 30 commits March 15, 2026 23:03
chore: sync dev with main merge commit (fast-forward)
.build-issue__branch already had white-space: nowrap + overflow: hidden +
text-overflow: ellipsis + max-width: 100%, but the truncation never fired.
Root cause: .build-issue__pr-row had align-items: flex-start inherited from
its flex-column parent (.build-issue__footer), which shrinks the row to its
content width. max-width: 100% then resolves against that already-oversized
parent and does nothing.

Fix: add width: 100% to .build-issue__pr-row so the percentage resolves
against the card width, not the branch name's intrinsic length.
fix: branch name truncates to card width (not hardcoded length)
… local model name

Three cosmetic/correctness fixes to the llm_iter row in the activity feed:

1. Remove color-coded left borders from all non-error activity rows.
   The border-left-color rules by subtype added visual noise without meaning.
   Only error rows (border-left: #ef4444) and shell exit-nonzero rows keep a
   coloured border because those carry real signal.

2. Drop the 'Iteration N' second line from llm_iter rows.
   The step count already establishes position; the iteration number inside a
   step is redundant. buildIterSummary now renders a single .af__iter-model
   span. Matching SCSS removes the flex-direction: column layout and the
   now-unused .af__iter-num selector.

3. Fix 'Local: local' — show the actual Ollama model name.
   Two root causes:
   a) Backend (llm.py): local_llm_tool_call emitted llm_iter with 'model: model'
      (the default parameter 'local') instead of 'model: agent_model or model'
      (the resolved settings value, e.g. 'qwen2.5:7b').
   b) Frontend (format_utils.ts): parseModelInfo had no handling for Qwen,
      Llama, Mistral, Phi, DeepSeek, or Gemma model strings from Ollama.
   Now: 'qwen2.5:7b' → { network: 'Local', modelShort: 'Qwen 2.5' } →
   display 'Local: Qwen 2.5'. 'local' (fallback) → modelShort '' → display
   just 'Local' (no redundant ': local' suffix).
fix: activity feed — remove color borders, drop iteration count, fix local model name
…in Docker

Tool rows (tool_invoked, github_tool) now expand on click to show the full
parsed arguments inline below the row:
- 4-column grid adds a chevron (af__chevron) that rotates 90° when open
- Click (or Enter/Space) toggles aria-expanded and the hidden detail panel
- af__tool-detail shows each arg as key · value using af__detail-key /
  af__detail-val; falls back to raw preview if parsing fails
- Keyboard-accessible: role=button, tabindex=0
- Light and dark mode styled
- 5 new unit tests covering expand, collapse, key-value rendering, and
  non-expandable row types

Also fixes a long-standing Docker gap: vitest.config.ts was never copied into
the image (only package.json, tsconfig.json etc. were). Without the config
Vitest used default include patterns (**/*.spec.ts) which picked up the
Playwright E2E spec and poisoned the jsdom environment for all other tests.
Fix: add vitest.config.ts to the Dockerfile COPY step. The config also gains
an explicit exclude for agentception/tests/** to prevent future E2E spec
pickup regardless of Vitest version changes.
feat: expandable tool call rows in activity feed
- llm_iter rows are suppressed entirely; instead a single .af__model-header
  element is inserted once at the top of #activity-feed showing the run model
  (e.g. "Local: Qwen 3.5") so it is not repeated on every step
- llm_usage rows are suppressed; their token count is injected directly into
  the .event-card__tokens placeholder on the current step header, appearing
  right-aligned on the same line as "STEP N"
- Removed per-subtype indentation from tool/file/shell/git_push rows — they
  are now flush with the rest of the feed since llm_iter is no longer a
  visual parent
- step_context: track _currentStepHeader; export getCurrentStepHeader()
- event_card: add empty .event-card__tokens span to step_start cards
- SCSS: .af__model-header styles, .event-card__tokens styles, pruned dead
  llm_iter/llm_usage overrides, updated .af__tool-detail padding
- 250 tests passing, zero type errors
feat: model info in feed header; token count on step header; remove tool indent
fix: colon separator in tool summary rows
fix: sans-serif colon separator; 0:00 timestamp instead of 'now'
…e entry

- Add humanizeDetailKey() to format_utils with a 40-key label map
  (n_results→results, cmd_preview→command, old_string→find, etc.)
- buildToolDetail uses humanizeDetailKey for all rendered keys
- start_line + end_line are collapsed into a single "lines: N–M" entry
  instead of two separate rows
- 260 tests passing, zero type errors
feat: humanize detail panel key names; collapse line range
Backend:
- FileReadPayload gains content_preview: NotRequired[str]
- read_file_lines emits first 10 lines / 400 chars as content_preview

Frontend:
- file_read rows are now expandable (click chevron to reveal preview)
- buildFileReadDetail renders content_preview in a scrollable <pre> block
- Falls back to a "no preview" note for historical events without the field
- .af__content-preview SCSS: compact code block, max-height 10rem, scrollable
- 263 tests passing, zero mypy errors, zero tsc errors
feat: file content preview in file_read activity events
Backend:
- DirListedPayload TypedDict: path, entry_count, entries (newline-delimited)
- dir_listed added to ACTIVITY_SUBTYPES
- agent_loop: emit dir_listed after successful list_directory with
  proper isinstance narrowing for JsonValue typing

Frontend:
- folder SVG icon added to icons.ts
- dir_listed: expandable row (folder icon, "N entries" summary)
- buildDirListedDetail renders entries as a pre block via af__content-preview
  (same SCSS reused from file_read content preview)
- 267 tests passing, zero mypy errors, zero tsc errors
feat: dir_listed activity event — show list_directory results in inspector
- Increase arg_preview 120→500 chars and text_preview 200→1500 chars so
  full LLM reply text and complete tool args fit without truncation.
- Make llm_reply rows expandable: clicking reveals the full text_preview
  in a scrollable prose block (af__content-preview--reply).
- Suppress the internal `collection` key (Qdrant collection name) from
  all tool detail panels via HIDDEN_DETAIL_KEYS.
- Render old_string/new_string pairs as a colour-coded find/replace diff
  block (af__diff-block--old / --new) instead of raw key-value lines.
- SCSS: add diff block styles with red/green tints for find/replace, and
  a wrapping prose variant for LLM reply previews.
- Tests: 8 new assertions covering llm_reply expand, diff rendering, and
  collection suppression — 271 tests passing.
feat(feed): llm reply expandable, search humanisation, code diff display
Adds scripts/retrofit_activity_events.py — idempotent back-fill for two
event types added after existing runs were recorded:

- dir_listed: reconstructs 29 events across 3 runs from tool result rows
  in agent_messages (entries list stored in raw JSON).
- llm_reply text_preview: extends 2 events whose previews were truncated
  at 200 chars to the new 1 500-char limit using the matching assistant
  message in agent_messages.

Safe to re-run; no-ops when there is nothing left to do.
Usage: docker compose exec agentception python3 /app/scripts/retrofit_activity_events.py [--dry-run]
feat(scripts): retrofit historical activity events
dir_listed is a result, not a peer event. Instead of rendering a separate
folder-icon row, inject directory entries directly into the detail panel
of the preceding list_directory tool_invoked row:

- Remove dir_listed from EXPANDABLE_SUBTYPES.
- Add injectDirListedIntoPanel(): finds the nearest [data-list-dir-target]
  panel in the current step container and appends an 'entries' key row +
  pre block.
- Tag list_directory tool_invoked detail panels with data-list-dir-target
  so the injector can find them.
- Route dir_listed events to the injector then return (no standalone row).
- Tests: replaced 4 dir_listed-as-row tests with 3 nesting-behaviour tests.
  270 tests passing.
feat(feed): nest dir_listed entries inside list_directory row
Same pattern as dir_listed / list_directory:

Backend:
- Add SearchResultsPayload TypedDict and 'search_results' to ACTIVITY_SUBTYPES.
- After search_codebase: emit search_results with deduplicated relative file
  paths from match objects.
- After search_text: emit search_results by parsing rg --heading output to
  extract file path lines.

Frontend:
- Tag search_codebase / search_text tool_invoked detail panels with
  data-search-target.
- Route search_results events to injectSearchResultsIntoPanel(), which finds
  the nearest data-search-target panel and appends a 'files' label + pre block.
- No standalone row created — results live inside the search row chevron.
- Tests: 4 new assertions; 274 tests passing.
feat(feed): show search result files nested inside search rows
Adds _backfill_search_results() to retrofit_activity_events.py:
- Finds runs with search_codebase/search_text tool_invoked events but no
  search_results activity events.
- Reconstructs file lists from agent_messages tool results: structured
  matches array for search_codebase, rg --heading output parsing for
  search_text.
- Matches invocations to results sequentially (same strategy as dir_listed).

Ran live: 61 search_results events inserted across 3 historical runs.
Script remains fully idempotent.
feat(scripts): retrofit search_results for historical runs
Three fixes:

1. Injection scope: search the full feed element (not just the current
   step container) when injecting dir_listed and search_results. This
   makes injection resilient to step-boundary timing edge cases where
   the step context changes between the tool_invoked and result events.

2. Label: n_results → 'limit' in ARG_KEY_LABELS. The param is the
   request cap, not the count of matches found. Showing 'results: 10'
   implied 10 results were found rather than requested.

3. Retrofit timestamps: search_results events now anchor to
   invocation.recorded_at + 1 s (not the tool result timestamp) to
   guarantee they sort strictly after their tool_invoked event. The
   script purges and re-inserts to fix the previously stored events.

274 tests passing.
AgentCeption Bot and others added 28 commits March 16, 2026 13:21
Named volumes shadow image layer content.  The agentception-fastembed-cache
volume was created when models were being downloaded to the wrong path
(before the HOME fix).  Docker never re-initialises a non-empty volume
from image content, so the stale empty volume permanently blocked the
correctly-baked models from being visible at runtime.

Fix: remove the named volume entirely.  The fastembed ONNX models are
baked into the image layer at build time (Dockerfile RUN step before
COPY agentception/).  Using the image layer directly is simpler and
more reliable — no stale-volume problem class possible.

Also remove the fastembed chown block from entrypoint.sh (no longer
needed; the image layer has correct ownership from the build step).

Verified: both jinaai/jina-embeddings-v2-base-code and
BAAI/bge-reranker-base load from the image layer in ~8s and ~20s
respectively with no download and no ONNX errors.
fix: remove fastembed_cache named volume — use image layer directly
The recon warning was opaque — 'could not parse plan from LLM response'
with no indication of what the model actually returned. Now logs the
first 500 chars of the raw response so the root cause (empty string,
plain text, malformed JSON, etc.) is immediately visible.
…ey for Ollama 0.18

Two fixes:

1. agent_loop.py: when recon JSON parse fails, log the first 500 chars
   of the raw LLM response so the root cause is immediately visible in
   logs rather than requiring a separate debug session.

2. llm.py: _normalize_openai_message_content checked message['reasoning']
   (Ollama ≤0.17) but Ollama 0.18+ uses message['reasoning_content'].
   When the key was wrong, the token-budget-exhausted warning was silently
   swallowed, completion() returned '', _parse_recon_json('') returned None,
   and the only visible symptom was the opaque 'could not parse plan'
   warning. Now checks both keys.
fix: log raw recon response on parse failure; fix reasoning_content key for Ollama 0.18
)

Reviewer agents intentionally return {"files":[],"searches":[],...} when
they don't need to pre-load source files before starting their loop.
The previous guard `if not files and not searches: return None` treated
this as a parse failure, emitting a spurious warning and aborting the
recon phase entirely.

Remove the guard.  An empty-lists plan is a valid signal: the recon
phase becomes a no-op and the agent proceeds immediately with existing
context (e.g. the PR diff already in the prompt).

Regression test added: test_parse_recon_json_empty_lists_valid confirms
both empty-with-plan and empty-without-plan are accepted.

Co-authored-by: AgentCeption Bot <agent@agentception.io>
#1164)

Silent JSONDecodeError swallowing made it impossible to diagnose why
_parse_recon_json was returning None.  Now logs:
  - the decode error position and message
  - up to 300 chars of the extracted JSON substring that failed
  - the type if the parsed value is unexpectedly not a dict

Also expands the outer warning's raw dump from 500 → 2000 chars so the
full LLM response is visible in logs when the parse fails, rather than
being truncated mid-plan-string.

Co-authored-by: AgentCeption Bot <agent@agentception.io>
#1165)

The model was treating the recon turn as a regular agent tool loop,
emitting read_file(...) bash-fence syntax instead of the required JSON
plan.  The original "Output ONLY the JSON object, nothing else" rule was
not strong enough to override the tool-use patterns in the main system
prompt.

New addendum explicitly states:
  - "THIS IS NOT A TOOL-USE TURN" in all-caps
  - Lists all forbidden output forms (tool calls, prose, non-json fences)
  - Lists the single required form (the json fence with the plan schema)
  - Explains why tool calls cannot work (no dispatcher in this turn)

Co-authored-by: AgentCeption Bot <agent@agentception.io>
The FORBIDDEN list and all-caps warnings were noise — the format example
is the constraint.  Replace 28 lines with 8:
  - one "not in a tool loop" line (replaces the FORBIDDEN list)
  - the json example (unchanged — this is what models actually follow)
  - the two constraint lines (≤8 files, ≤5 searches, omit pre-loaded)

Co-authored-by: AgentCeption Bot <agent@agentception.io>
#1167)

With think=True each tool-call iteration was burning 30k+ tokens on
internal reasoning before writing a single tool call, exhausting the
entire 32k ceiling and producing:
  - empty content (stop_reason=tool_calls, 0 actual calls)
  - "token budget exhausted during chain-of-thought" warning

The agent loop is already an iterative reasoning mechanism — it builds
understanding turn-by-turn through tool calls.  Per-iteration CoT adds
latency and budget pressure without improving tool-selection quality.

think=False lets the model act immediately on each turn.  If a specific
high-stakes call needs CoT in future, it can opt in at the call site.

Co-authored-by: AgentCeption Bot <agent@agentception.io>
…1168)

Root cause: fastembed defaults its cache to tempfile.gettempdir() =
/tmp/fastembed_cache, which Docker wipes on every container restart.
The Dockerfile pre-download used HOME=/home/agentception (irrelevant —
fastembed never reads HOME) so models landed in /tmp during the build
and were either lost or never reachable by the app which explicitly
passes cache_dir=/home/agentception/.cache/fastembed.

Fix:
- Named Docker volume agentception-fastembed-cache mounted at
  /home/agentception/.cache/fastembed — the exact path code_indexer
  already passes as cache_dir.  Volume survives every restart and rebuild.
- FASTEMBED_CACHE_PATH env var set to match, so the entrypoint's
  startup check and any future manual downloads use the same path.
- entrypoint.sh verifies both ONNX files at startup; re-downloads only
  if missing (self-healing for fresh installs or volume corruption).
- Dockerfile pre-download step removed: it was always writing to /tmp,
  never to the path the app reads from, and was the source of repeated
  600 MB re-downloads.  Builds are now fast and deterministic.
- Fix pre-existing test breakage: _role_base_fallback and _load_role_prompt
  tests updated to import from role_loader (where they moved in PR #1148).

Co-authored-by: AgentCeption Bot <agent@agentception.io>
A 404 response from Qdrant means the collection for this worktree does
not exist yet — it gets created when the agent indexes the codebase, but
the recon phase searches before any indexing has happened.  This is
expected for every new agent run and should not appear as a WARNING.

Demote to INFO with a clear message so operators know it is normal.
Real failures (non-404 exceptions) still log at WARNING.

Co-authored-by: AgentCeption Bot <agent@agentception.io>
think=true/false is silently ignored by Ollama's OpenAI-compatible
/v1/chat/completions endpoint — all three of think=None, think=True,
think=False produce identical output (145 completion tokens, 3.7s).
reasoning_effort="none" is the correct parameter and actually works:
2 tokens, 0.2s for a simple response; 31 tokens, 1.8s for a tool call.

At agent scale (10 rounds, 29k context) default thinking burned 4850
completion tokens and took 175 s per turn; with reasoning_effort="none"
the same turn costs 173 tokens in 10.6s — 28x fewer tokens, 16x faster.

Changes:
- call_local_llm_with_tools: "think": False → "reasoning_effort": "none"
- _local_completion_payload: think: bool param → reasoning_effort: str param
  (default "none" for recon/summaries; streaming planner uses "medium")
- _normalize_openai_message_content: updated warning message to reflect
  that exhaustion is now unexpected and points at the correct fix
fix: replace dead think=false with reasoning_effort=none for Ollama
Adds a comprehensive 'Thinking mode and reasoning_effort' section to
docs/guides/local-llm.md covering:

- Why think=true/false is silently ignored on /v1/chat/completions
- reasoning_effort="none" as the correct parameter (Ollama 0.18+)
- Measured token/time data: 4850 tokens/175s → 173 tokens/10.6s at 29k ctx
- Parameter reference table for all reasoning_effort values
- Which code paths use which setting and why
- Verification curl command
- CLI usage for interactive sessions
- Ollama version requirement and upgrade instructions

Also updates outdated config defaults throughout the doc:
LOCAL_LLM_COMPLETION_TOKEN_CEILING 8192→32768,
LOCAL_LLM_MAX_CONTEXT_CHARS 24000→60000,
LOCAL_LLM_MAX_SYSTEM_CHARS 12000→20000.
…docs

docs: document reasoning_effort=none for Ollama thinking mode control
…vars

- LOCAL_LLM_COMPLETION_TOKEN_CEILING: 16384 → 32768 (matches config.py/docker-compose)
- Update reasoning token comment: reasoning_effort=none eliminates CoT tokens entirely
- Fix example model: qwen3:30b-a3b → qwen3.5:35b-a3b-q4_K_M (current recommended)
- Add USE_LOCAL_LLM documentation (was in docker-compose but missing from example)
- Add HOST_REPO_DIR documentation
- Add multi-model override vars (LOCAL_LLM_BASE_URL_PLAN, MODEL_PLAN, BASE_URL_AGENT, MODEL_AGENT)
- Rewrite section headers and comments for clarity throughout
- Group WORKTREE_INDEX_ENABLED and AC_MIN_TURN_DELAY_SECS with better descriptions

Also fix .env (local, gitignored):
- LOCAL_LLM_MAX_CONTEXT_CHARS: 24000 → 60000 (matches docker-compose default)
- Remove stale mlx-openai-server comment
fix: sync .env.example with current config defaults and document all vars
The reviewer's initial message already contains the full briefing header
(OWNER, REPO, BRANCH, PR_NUMBER, ISSUE_NUMBER) plus the pre-loaded review
context (git diff, mypy, pytest, issue). Instructing the agent to call
task/briefing returns the exact same content and wastes one full turn.

Replace 'All task context is in your task/briefing MCP prompt. Read it once'
with an explicit instruction NOT to call any MCP prompt tool, and point the
agent to the briefing header fields already visible in the initial message.
…-call

fix: remove task/briefing MCP call from reviewer role prompt
…wer prohibition

The _RUNTIME_ENV_NOTE was telling every agent — including the reviewer — to
read ac://runs/{run_id}/context for full task context. This directive belongs
at the briefing layer (where the developer gets it when the issue body is not
directly injected), not in the runtime environment note which is about
Docker/git/tools mechanics.

Removing it from _RUNTIME_ENV_NOTE is the correct architectural fix: the
developer still gets directed to the resource via the briefing itself when
needed. The reviewer never needs it — the warmup pre-loads diff, mypy, pytest,
and the issue into messages[0] before the loop starts.

Also removes the now-unnecessary "do not call task/briefing" prohibition from
the reviewer role template. The prohibition was fighting the old instruction
that said to call it; with both the instruction and the runtime directive gone,
there is no ghost left to fight. Less prompt = less confusion.
fix: remove ac://runs context directive from runtime note; trim reviewer prohibition
The org chart's Launch button always called POST /api/dispatch/label.
That endpoint created a fresh worktree from dev and set task.branch to
the throw-away label branch — not the implementer's PR branch — so the
reviewer warmup computed a zero-diff against dev and the agent started
with no context.

Comprehensive fix across three layers:

1. agentception/readers/github.py — add find_pr_for_issue(issue_number,
   repo) that locates the open PR by branch-name pattern (agent/issue-N)
   first, then by closing keywords (closes/fixes/resolves #N) in the PR
   body. Returns None on API failure rather than raising.

2. agentception/routes/api/dispatch.py — detect role=="reviewer" +
   scope=="issue" early in dispatch_label_agent and delegate to the new
   _dispatch_label_reviewer helper.  The helper:
   - looks up the PR via find_pr_for_issue (422 if none found)
   - fetches the remote PR branch (422 if branch deleted)
   - creates the worktree anchored to origin/<pr_branch> (409 if exists)
   - sets run_id=review-{pr_number}, branch=<pr_branch> in the DB so
     agent_loop's _run_reviewer_warmup checks out the right code
   - passes pr_number to persist_agent_run_dispatch
   All other roles/scopes continue through the existing label path.

3. agentception/services/agent_loop.py — fix the reviewer warmup fallback
   from the deprecated feat/issue-{N} to agent/issue-{N}, and emit a
   structured WARNING when task.branch is missing so mis-configured
   dispatches are immediately visible in logs.

12 new tests cover every outcome: PR found by branch, PR found by body
keyword, all closing-keyword variants, branch-over-body priority, no PR
found, branch deleted, worktree exists, developer routes unchanged, and
the warmup fallback warning.
fix: route reviewer org-chart dispatch to PR branch, not dev
The UI slug 'pr-reviewer' never matched any backend check (role == 'reviewer'),
so reviewer agents were dispatched as plain label workers with no warmup,
no reviewer tool surface, and no PR branch anchoring.

Renames every occurrence — the role is reviewer, not pr-reviewer:
- org_designer.ts: slug, qa-coordinator rules, WORKER_RULES
- agent.html: org-node emoji/label conditionals
- transcripts.html, transcript_detail.html: badge colour maps

Rebuilds app.js bundle.
…iewer

fix: rename pr-reviewer → reviewer everywhere
Co-authored-by: AgentCeption Bot <agent@agentception.io>
Co-authored-by: AgentCeption Bot <agent@agentception.io>
@cgcardona cgcardona merged commit bf97c77 into main Mar 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant