Feature/crater agent by MOONSakura0614 · Pull Request #414 · raids-lab/crater

MOONSakura0614 · 2026-06-07T02:32:10Z

No description provided.

# Conflicts: # backend/cmd/gorm-gen/models/migrate.go # backend/docs/docs.go # backend/docs/swagger.json # backend/docs/swagger.yaml

k8s_get_events, k8s_describe_resource, k8s_get_pod_logs removed from ADMIN_ONLY_TOOL_NAMES. Ownership scoping added in follow-up tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

k8s_get_events, k8s_describe_resource, k8s_get_pod_logs now included in agentUserTools so non-admin users receive these tool schemas.

GET /api/agent/k8s-ownership?pod_name=X&user_id=Y checks whether the pod belongs to a job owned by the given user. Used by Python local executor to scope k8s read tools for non-admin users.

job.Name is display name; job.JobName is the k8s resource name that pod names are derived from (e.g. {jobName}-worker-0). Fix ownership check to use the correct field.

k8s_get_events, k8s_describe_resource, k8s_get_pod_logs now verify pod ownership via Go backend /api/agent/k8s-ownership before executing kubectl for non-admin users. Admin users bypass the check.

Tool schemas are delivered to LLM via bind_tools() API; listing all enabled_tools by name in prompt text was noisy and had no effect on tool selection. Confirm-only tools remain listed as behavioral hint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Explicitly return 400 for user_id=0 to surface misconfiguration instead of silently returning allowed=false from a no-match DB query.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…leMetadata Fixes the HTTP 500 that fired whenever the detail page tried to search by session UUID and the several downstream UI issues reported: - agent_audit.go: cast s.session_id to text in the ILIKE branch of the keyword base query. Searching with a UUID string used to crash with "operator does not exist: uuid ~~* unknown (SQLSTATE 42883)". - Add GetSessionDetail(sessionID) service method + GET /admin/agent/ sessions/:sessionId/detail handler. Returns the same enriched AgentAuditSessionListItem the list query builds, so the detail page can fetch one session by ID instead of keyword-searching through a paginated list. Route wired in RegisterAdmin. - Aggregate orchestration modes per session via STRING_AGG(DISTINCT orchestration_mode) over agent_turns, split to []string in Go. AgentAuditSessionListItem gains OrchestrationModes field so the detail page can show both single_agent and multi_agent when a conversation used both across turns. last_orchestration_mode is kept for compatibility. - Refactor the list+detail query into buildAgentAuditSessionEnrichedQuery so both code paths share the same SELECT and LEFT JOINs. - vcjob/agent_submit.go: feed the agent submitter with the full VolcanojobMgr service set (configService / queueQuotaSvc / prequeueWatcher / billingService). Both SubmitJupyterJob and SubmitTrainingJob now call req.validateScheduleOptions + mgr.resolveJobScheduleMetadata and pass the resolved *jobScheduleMetadata to getLabelAndAnnotations, matching the regular /v1/vcjobs/* path so agent-created jobs carry the same prequeue / backfill / tolerance annotations as user-created ones. - Regenerate backend/docs/{docs.go,swagger.json,swagger.yaml} via 'make docs' so the swagger now reflects this branch's endpoints (trigger-eval, quality-evals admin list, sessions/:id/detail) instead of inheriting main's sloppy copy.

… left Front-end follow-up to the backend uuid/detail/modes fixes. - services/api/admin/agentAudit.ts: * add apiAdminGetAgentAuditSessionDetail (GET /admin/agent/ sessions/:id/detail) so the detail page fetches a single session directly instead of keyword-searching through the list API with the full UUID (which hit the ILIKE-on-uuid bug). * add AgentAuditSessionListItem.orchestrationModes so the detail page can show every mode used across turns. - routes/admin/more/agent-audit/$sessionId.tsx: * switch the session-metadata query to the new detail endpoint. * drop the top-left "返回列表" button (users still have the nav breadcrumb + browser back; admin consoles don't need a second one). * render OrchestrationModes as a badge list (single_agent + multi_agent both appear when a session mixed modes across turns). * compute message / tool-call / turn counts from the actual fetched arrays as a fallback when the enriched JOIN is stale or zero. * treat the detail query's error as "not found" so non-chat sessions that previously 500'd now fall back cleanly. - routes/admin/more/agent-audit/index.tsx: drop the mx-auto from the feedback column's thumbs-up/down icons so the icon sits left-aligned in its table cell, matching where the fallback dash starts. Before, icons were centered but "-" was left-start and the two kinds of rows visually jumped.

The AgentQualityEval table types turn_id as uuid, and PostgreSQL rejects empty-string uuid literals ('22P02: invalid input syntax for type uuid: ""'). That made POST /admin/agent/sessions/:id/trigger-eval and the feedback-driven path 500 whenever the caller did not provide a turn ID (session-level eval from the admin audit panel, and thumbs-up/down on the conversation as a whole). CreateQualityEval now detects eval.TurnID == "" and uses tx.Omit("TurnID") to exclude the column from the INSERT, letting the DB default (NULL) apply. The GORM-returned *model.AgentQualityEval still has its ID populated, so the downstream goroutine can still forward { eval_id, session_id, turn_id: "" } to crater-agent's /eval/quality/session endpoint.

…rds drive filter Detail page rebuild ------------------- Replace the ad-hoc "big white card + tabs" layout with @/components/layout/detail-page (the same DetailPage abstraction that jobs/detail uses). The route now wires: - header: PageTitle title + copyable session id + source badge - info: 10 icon-label-value rows in the 2/4-col grid (owner, account, orchestration modes, eval status, msgs/tools/turns counts with colour-coded numbers, feedback direction, created/updated times via TimeDistance) - tabs: TabsList with underline/icon style - 对话时间线 / 质量评测 / 用户反馈 / 工具调用 / 执行轮次 - currentTab/setCurrentTab: URL search param ?tab=... via detailLinkOptions (matches jobs pattern; tab state survives refreshes) Also adds validateSearch + loader({ params }) => ({ crumb: short id }) so the app breadcrumb now shows agent-audit -> <short-id> and the top-left path navigation replaces the ad-hoc "返回列表" button that was removed earlier. List page overview ↔ filter bar wiring -------------------------------------- The 4 source summary cards at the top are now fully interactive and are the canonical source filter; source is dropped from the toolbar's faceted filters to avoid duplication. Click a card to filter the DataTable below to that source; click again to clear. Each card gains an accent colour (blue/amber/emerald/fuchsia) tied to the source, plus a coloured icon chip, so they read as filter buttons rather than static stats. Implementation wise, we use two useQuery hooks with the same queryKey (one transforms to summary, one to items via source filter in select) - TanStack Query dedupes the fetch and lets React re-render only the items path when sourceFilter changes. Number column colour -------------------- msgs / tools / turns counts in the list (and the corresponding info rows on the detail page) are now coloured sky / orange / violet so the eye can parse them at a glance instead of blending into everything else.

Add tool-loop stop-signal detection (k8s top, diagnose_job, capacity coverage), diagnostic evidence request matching, and follow-up prompt tightening across coordinator/planner/explorer/executor/verifier and intent router. Plan-execute orchestrator follows the same prompt profile split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Relocates 智能运维 navigation from primary groups into the 更多 menu on both admin and portal sidebars, and adds the missing crumb loader so the AIOps page now renders its breadcrumb correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- llm-clients.json: register glm-4.5 and qvq-max-2025-03-25 entries - .gitignore: ignore local generated artifacts (output/, nested crater-agent/crater-agent/, .drawio.bkp, .venv/, __pycache__/) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bring in offline thesis materials so the repo carries the same references the agent work is being designed against: - Chapter3.md, Chapter4.md: rewrite drafts in repo root - docs/zh-CN/thesis-mops-ch3-ch4-rewrite.{md (already tracked),pdf} - docs/zh/, docs/0511/, docs/thesis-v2/: prior single-chapter and multi-version drafts kept for cross-reference - crater-agent/docs0511/: per-chapter draft set used while reshaping the agent service - docs-figure/: drawio sources and PNG exports for ch1-ch4 figures - mops_arch.drawio[.png], system_arch.drawio[.png], mops-architecture.drawio: top-level architecture diagrams Also extend .gitignore for drawio editor temp files (*.drawio.dtmp, .$*.drawio.dtmp) that the editor leaves behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Backend: - New PUT /api/v1/agent/sessions/:sessionId/title endpoint - AgentSessionTitleRequest type and UpdateSessionTitle handler with owner check (mirrors UpdateSessionPin pattern) - Service.UpdateSessionTitle now trims whitespace, rejects empty input, truncates to 100 runes, and uses UpdateColumn so the session's updated_at timestamp is NOT bumped on rename — renames do not count as new activity for sort order. Frontend: - apiRenameSession(sessionId, title) in services/api/agent.ts - AIChatDrawer session list item: - New Edit (Pencil) icon next to Pin, only visible on group-hover - Click → title cell turns into inline Input prefilled with current title - Enter or blur commits via apiRenameSession; Esc cancels - Loading spinner during save; toast on failure - Empty / unchanged input drops back to view mode without API call - 100-char client-side maxLength matches backend truncation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Carry forward the .gitignore rules added on the demo branch: mindmap.drawio files, the zhisuan-platform-ops-challenges brainstorm diagrams, and several figure-source rewrites that stay in the working tree only — they're personal notes, not part of the agent codebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The success-analysis `resource_efficiency` block was wrong on all three dimensions because it treated every field as a 0-100 percentage when in fact the backend's Prometheus queries (backend/pkg/monitor/query.go) emit different units: - gpu_util_avg = 0.0-1.0 ratio (DCGM_FI_DEV_GPU_UTIL / 100 server-side) - cpu_usage_avg = CPU cores absolute (irate of cpu_usage_seconds_total) - mem_usage_avg = MiB absolute (container_memory_usage_bytes / 1Mi) Previously each value was divided by 100 again, producing: - GPU 利用率 0.2% (real: 20%) - 平均内存利用率 36189.3% (36189 MiB / 100; real: ~110% of a 32Gi request) - CPU 7.8% looked plausible only by coincidence (7.8 cores out of 8) Fixes: 1. New _parse_cpu_cores / _parse_memory_mib helpers parse Kubernetes resource.Quantity strings (Ki/Mi/Gi/Ti binary + K/M/G/T/P decimal + 500m for CPU + raw bytes). 2. New _normalize_gpu_util_ratio collapses both ratio (0-1) and legacy percent (0-100) inputs to a clamped 0..1 ratio. 3. _build_success_analysis now: - Uses GPU util as-is (already a ratio). - Derives CPU ratio = cpu_used_cores / parse(requested.cpu). - Derives memory ratio = mem_used_mib / parse(requested.memory). - Skips samples without a declared request (no 0/0 averaging). 4. _estimate_over_provisioned_count compares gpu_util against 0.40 (ratio) instead of 40.0 — the old threshold was always true and over-counted every multi-GPU job. 5. _build_analysis_prompt renders proper units to the LLM: GPU as percentage (ratio*100%), CPU as "used/requested cores", memory as "used/requested MiB" — no more misleading "%" suffixes on raw cores/MiB values. Verified with realistic fixtures: a job using 7.8/8 cores, 36189/32768 MiB, 0.2 GPU ratio now reports 97.5% / ~110% / 20% respectively, and the over-provisioned detector flags it (was over-counting before). Related Go-side bugs (out of scope for this commit): - tools_readonly.go:1317 estimateActualGPUUsage divides gpu_util by 100 a second time, leaving gpu_actual_used near 0 for most jobs. - tools_readonly.go:1555-1557, :2174, :2200 format 0-1 GPU ratios with "%.1f%%" producing "0.2%" instead of "20.0%". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cover docs-figure/ch3/fig3-3-mops-layered-architecture.drawio (and its png export) so the un-tracked file no longer shows up in git status. Also note: 5 tracked drawio files (ch1 figures, ch2 cloud-native stack, ch3 mas-roles, ch5 dataset-distribution) have local edits marked as skip-worktree to silence them locally — they're personal thesis work, not part of the agent codebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # backend/cmd/gorm-gen/models/migrate.go # frontend/src/components/layout/detail-page.tsx # frontend/src/components/layout/page-title.tsx # frontend/src/i18n/locales/enUS/translation.json # frontend/src/i18n/locales/ja/translation.json # frontend/src/i18n/locales/ko/translation.json

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

MOONSakura0614 and others added 30 commits March 15, 2026 17:51

feat: implement AI operations

f861710

fix: lint

365315b

Merge branch 'main' into feature_simple_chatbot

16684b0

fix: lint

1c19f0b

fix copilot review

abbaea1

update swagger

aa72558

feat: add floating assistant button and health overview components

e863a32

fix: enhance error handling with localized messages and response keys

18083c4

fix rule

6804acf

feat: implement agent session management and tool execution confirmation

9b30867

feat(agent): wire chat timeline and confirmation flow

334416f

feat: integrate crater agent runtime and ops report flow

0c1f2f5

fix

a0326fb

Merge remote-tracking branch 'origin/main' into feature/crater-agent

e3b2381

# Conflicts: # backend/cmd/gorm-gen/models/migrate.go # backend/docs/docs.go # backend/docs/swagger.json # backend/docs/swagger.yaml

refactor: simplify crater agent llm config

775d13a

refactor: centralize crater agent llm config

8b5e192

Refine agent orchestration and runtime flow

c075325

fix: agent final response quality and confirmation spinner state

05c06ae

feat(agent-v0):0416

3c1181a

feat(agent): open k8s read tools to non-admin users

2538c90

k8s_get_events, k8s_describe_resource, k8s_get_pod_logs removed from ADMIN_ONLY_TOOL_NAMES. Ownership scoping added in follow-up tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(agent): add k8s_get_pod_logs to NEW_READ_ONLY_TOOLS coverage

8e1c600

feat(agent): move k8s read tools to user capabilities

bb11237

k8s_get_events, k8s_describe_resource, k8s_get_pod_logs now included in agentUserTools so non-admin users receive these tool schemas.

feat(agent): add k8s-ownership internal check endpoint

f0d4608

GET /api/agent/k8s-ownership?pod_name=X&user_id=Y checks whether the pod belongs to a job owned by the given user. Used by Python local executor to scope k8s read tools for non-admin users.

fix(agent): use JobName (k8s resource name) for pod ownership check

c71dc8e

job.Name is display name; job.JobName is the k8s resource name that pod names are derived from (e.g. {jobName}-worker-0). Fix ownership check to use the correct field.

config: enable webSearch with ops-relevant domain allowlist

2e63087

feat(agent): add ownership scoping for user-accessible k8s tools

2f8de33

k8s_get_events, k8s_describe_resource, k8s_get_pod_logs now verify pod ownership via Go backend /api/agent/k8s-ownership before executing kubectl for non-admin users. Admin users bypass the check.

fix(agent): reject user_id=0 in k8s-ownership check

daef0d1

Explicitly return 400 for user_id=0 to surface misconfiguration instead of silently returning allowed=false from a no-match DB query.

chore(deps): add camel-ai and duckduckgo-search

2e20ebf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(tools): add CAMEL web search and code execution adapters

10cced6

MOONSakura0614 and others added 23 commits April 23, 2026 20:37

chore: gitignore docs/thesis/ so thesis drafts stay out of VCS

63bec79

feat: refresh agent workflows and evaluation plumbing

dc81562

fix agent confirmation resume context

7baf4a8

feat: commit core agent changes

50e916c

chore(agent): preserve offline experiment config

8f7456c

chore(agent): save local experiment support changes

4bbc9c7

chore: ignore ch2 llm-agent-overview thesis figure

299ebe0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix

34db704

Clean agent handoff implementation

609cb7a

chore: satisfy hooks after agent handoff merge

c8f12a2

Copilot AI review requested due to automatic review settings June 7, 2026 02:32

Copilot AI reviewed Jun 7, 2026

MOONSakura0614 added 5 commits June 7, 2026 12:16

chore: restore frontend mocks and remove agent docs

a14a288

chore: tighten agent handoff diff

18c76cb

docs: add crater agent handoff guide

866ebfe

docs: clarify crater agent runtime setup

216e7a2

fix: satisfy backend goconst lint

e2b2442

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/crater agent#414

Feature/crater agent#414
MOONSakura0614 wants to merge 88 commits into
raids-lab:mainfrom
MOONSakura0614:feature/crater-agent

MOONSakura0614 commented Jun 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

MOONSakura0614 commented Jun 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants