Skip to content

Feature/crater agent#414

Open
MOONSakura0614 wants to merge 88 commits into
raids-lab:mainfrom
MOONSakura0614:feature/crater-agent
Open

Feature/crater agent#414
MOONSakura0614 wants to merge 88 commits into
raids-lab:mainfrom
MOONSakura0614:feature/crater-agent

Conversation

@MOONSakura0614

Copy link
Copy Markdown

No description provided.

MOONSakura0614 and others added 30 commits March 15, 2026 17:51
# Conflicts:
#	backend/cmd/gorm-gen/models/migrate.go
#	backend/docs/docs.go
#	backend/docs/swagger.json
#	backend/docs/swagger.yaml
k8s_get_events, k8s_describe_resource, k8s_get_pod_logs removed from
ADMIN_ONLY_TOOL_NAMES. Ownership scoping added in follow-up tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
k8s_get_events, k8s_describe_resource, k8s_get_pod_logs now included
in agentUserTools so non-admin users receive these tool schemas.
GET /api/agent/k8s-ownership?pod_name=X&user_id=Y checks whether the pod
belongs to a job owned by the given user. Used by Python local executor
to scope k8s read tools for non-admin users.
job.Name is display name; job.JobName is the k8s resource name that
pod names are derived from (e.g. {jobName}-worker-0). Fix ownership
check to use the correct field.
k8s_get_events, k8s_describe_resource, k8s_get_pod_logs now verify
pod ownership via Go backend /api/agent/k8s-ownership before executing
kubectl for non-admin users. Admin users bypass the check.
Tool schemas are delivered to LLM via bind_tools() API; listing all
enabled_tools by name in prompt text was noisy and had no effect on
tool selection. Confirm-only tools remain listed as behavioral hint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explicitly return 400 for user_id=0 to surface misconfiguration
instead of silently returning allowed=false from a no-match DB query.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MOONSakura0614 and others added 23 commits April 23, 2026 20:37
…leMetadata

Fixes the HTTP 500 that fired whenever the detail page tried to search by
session UUID and the several downstream UI issues reported:

- agent_audit.go: cast s.session_id to text in the ILIKE branch of the
  keyword base query. Searching with a UUID string used to crash with
  "operator does not exist: uuid ~~* unknown (SQLSTATE 42883)".

- Add GetSessionDetail(sessionID) service method + GET /admin/agent/
  sessions/:sessionId/detail handler. Returns the same enriched
  AgentAuditSessionListItem the list query builds, so the detail page
  can fetch one session by ID instead of keyword-searching through a
  paginated list. Route wired in RegisterAdmin.

- Aggregate orchestration modes per session via STRING_AGG(DISTINCT
  orchestration_mode) over agent_turns, split to []string in Go.
  AgentAuditSessionListItem gains OrchestrationModes field so the
  detail page can show both single_agent and multi_agent when a
  conversation used both across turns. last_orchestration_mode is kept
  for compatibility.

- Refactor the list+detail query into buildAgentAuditSessionEnrichedQuery
  so both code paths share the same SELECT and LEFT JOINs.

- vcjob/agent_submit.go: feed the agent submitter with the full
  VolcanojobMgr service set (configService / queueQuotaSvc /
  prequeueWatcher / billingService). Both SubmitJupyterJob and
  SubmitTrainingJob now call req.validateScheduleOptions +
  mgr.resolveJobScheduleMetadata and pass the resolved
  *jobScheduleMetadata to getLabelAndAnnotations, matching the regular
  /v1/vcjobs/* path so agent-created jobs carry the same prequeue /
  backfill / tolerance annotations as user-created ones.

- Regenerate backend/docs/{docs.go,swagger.json,swagger.yaml} via
  'make docs' so the swagger now reflects this branch's endpoints
  (trigger-eval, quality-evals admin list, sessions/:id/detail) instead
  of inheriting main's sloppy copy.
… left

Front-end follow-up to the backend uuid/detail/modes fixes.

- services/api/admin/agentAudit.ts:
  * add apiAdminGetAgentAuditSessionDetail (GET /admin/agent/
    sessions/:id/detail) so the detail page fetches a single session
    directly instead of keyword-searching through the list API with
    the full UUID (which hit the ILIKE-on-uuid bug).
  * add AgentAuditSessionListItem.orchestrationModes so the detail
    page can show every mode used across turns.

- routes/admin/more/agent-audit/$sessionId.tsx:
  * switch the session-metadata query to the new detail endpoint.
  * drop the top-left "返回列表" button (users still have the nav
    breadcrumb + browser back; admin consoles don't need a second one).
  * render OrchestrationModes as a badge list (single_agent + multi_agent
    both appear when a session mixed modes across turns).
  * compute message / tool-call / turn counts from the actual fetched
    arrays as a fallback when the enriched JOIN is stale or zero.
  * treat the detail query's error as "not found" so non-chat sessions
    that previously 500'd now fall back cleanly.

- routes/admin/more/agent-audit/index.tsx: drop the mx-auto from the
  feedback column's thumbs-up/down icons so the icon sits left-aligned
  in its table cell, matching where the fallback dash starts. Before,
  icons were centered but "-" was left-start and the two kinds of rows
  visually jumped.
The AgentQualityEval table types turn_id as uuid, and PostgreSQL rejects
empty-string uuid literals ('22P02: invalid input syntax for type uuid:
""'). That made POST /admin/agent/sessions/:id/trigger-eval and the
feedback-driven path 500 whenever the caller did not provide a turn ID
(session-level eval from the admin audit panel, and thumbs-up/down on
the conversation as a whole).

CreateQualityEval now detects eval.TurnID == "" and uses tx.Omit("TurnID")
to exclude the column from the INSERT, letting the DB default (NULL)
apply. The GORM-returned *model.AgentQualityEval still has its ID
populated, so the downstream goroutine can still forward
{ eval_id, session_id, turn_id: "" } to crater-agent's
/eval/quality/session endpoint.
…rds drive filter

Detail page rebuild
-------------------
Replace the ad-hoc "big white card + tabs" layout with
@/components/layout/detail-page (the same DetailPage abstraction that
jobs/detail uses). The route now wires:

- header: PageTitle title + copyable session id + source badge
- info: 10 icon-label-value rows in the 2/4-col grid (owner, account,
  orchestration modes, eval status, msgs/tools/turns counts with
  colour-coded numbers, feedback direction, created/updated times via
  TimeDistance)
- tabs: TabsList with underline/icon style - 对话时间线 / 质量评测 /
  用户反馈 / 工具调用 / 执行轮次
- currentTab/setCurrentTab: URL search param ?tab=... via detailLinkOptions
  (matches jobs pattern; tab state survives refreshes)

Also adds validateSearch + loader({ params }) => ({ crumb: short id }) so
the app breadcrumb now shows agent-audit -> <short-id> and the top-left
path navigation replaces the ad-hoc "返回列表" button that was removed
earlier.

List page overview ↔ filter bar wiring
--------------------------------------
The 4 source summary cards at the top are now fully interactive and are
the canonical source filter; source is dropped from the toolbar's
faceted filters to avoid duplication. Click a card to filter the
DataTable below to that source; click again to clear. Each card gains
an accent colour (blue/amber/emerald/fuchsia) tied to the source,
plus a coloured icon chip, so they read as filter buttons rather than
static stats.

Implementation wise, we use two useQuery hooks with the same queryKey
(one transforms to summary, one to items via source filter in select)
- TanStack Query dedupes the fetch and lets React re-render only the
items path when sourceFilter changes.

Number column colour
--------------------
msgs / tools / turns counts in the list (and the corresponding info
rows on the detail page) are now coloured sky / orange / violet so
the eye can parse them at a glance instead of blending into
everything else.
Add tool-loop stop-signal detection (k8s top, diagnose_job, capacity
coverage), diagnostic evidence request matching, and follow-up prompt
tightening across coordinator/planner/explorer/executor/verifier and
intent router. Plan-execute orchestrator follows the same prompt
profile split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relocates 智能运维 navigation from primary groups into the 更多 menu
on both admin and portal sidebars, and adds the missing crumb loader so
the AIOps page now renders its breadcrumb correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- llm-clients.json: register glm-4.5 and qvq-max-2025-03-25 entries
- .gitignore: ignore local generated artifacts (output/, nested
  crater-agent/crater-agent/, .drawio.bkp, .venv/, __pycache__/)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bring in offline thesis materials so the repo carries the same
references the agent work is being designed against:

- Chapter3.md, Chapter4.md: rewrite drafts in repo root
- docs/zh-CN/thesis-mops-ch3-ch4-rewrite.{md (already tracked),pdf}
- docs/zh/, docs/0511/, docs/thesis-v2/: prior single-chapter and
  multi-version drafts kept for cross-reference
- crater-agent/docs0511/: per-chapter draft set used while reshaping
  the agent service
- docs-figure/: drawio sources and PNG exports for ch1-ch4 figures
- mops_arch.drawio[.png], system_arch.drawio[.png],
  mops-architecture.drawio: top-level architecture diagrams

Also extend .gitignore for drawio editor temp files (*.drawio.dtmp,
.$*.drawio.dtmp) that the editor leaves behind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backend:
- New PUT /api/v1/agent/sessions/:sessionId/title endpoint
- AgentSessionTitleRequest type and UpdateSessionTitle handler with
  owner check (mirrors UpdateSessionPin pattern)
- Service.UpdateSessionTitle now trims whitespace, rejects empty
  input, truncates to 100 runes, and uses UpdateColumn so the
  session's updated_at timestamp is NOT bumped on rename — renames
  do not count as new activity for sort order.

Frontend:
- apiRenameSession(sessionId, title) in services/api/agent.ts
- AIChatDrawer session list item:
  - New Edit (Pencil) icon next to Pin, only visible on group-hover
  - Click → title cell turns into inline Input prefilled with current title
  - Enter or blur commits via apiRenameSession; Esc cancels
  - Loading spinner during save; toast on failure
  - Empty / unchanged input drops back to view mode without API call
  - 100-char client-side maxLength matches backend truncation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Carry forward the .gitignore rules added on the demo branch:
mindmap.drawio files, the zhisuan-platform-ops-challenges
brainstorm diagrams, and several figure-source rewrites that
stay in the working tree only — they're personal notes, not
part of the agent codebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The success-analysis `resource_efficiency` block was wrong on all three
dimensions because it treated every field as a 0-100 percentage when in
fact the backend's Prometheus queries (backend/pkg/monitor/query.go)
emit different units:

- gpu_util_avg = 0.0-1.0 ratio (DCGM_FI_DEV_GPU_UTIL / 100 server-side)
- cpu_usage_avg = CPU cores absolute (irate of cpu_usage_seconds_total)
- mem_usage_avg = MiB absolute (container_memory_usage_bytes / 1Mi)

Previously each value was divided by 100 again, producing:
- GPU 利用率 0.2% (real: 20%)
- 平均内存利用率 36189.3% (36189 MiB / 100; real: ~110% of a 32Gi request)
- CPU 7.8% looked plausible only by coincidence (7.8 cores out of 8)

Fixes:

1. New _parse_cpu_cores / _parse_memory_mib helpers parse Kubernetes
   resource.Quantity strings (Ki/Mi/Gi/Ti binary + K/M/G/T/P decimal +
   500m for CPU + raw bytes).
2. New _normalize_gpu_util_ratio collapses both ratio (0-1) and legacy
   percent (0-100) inputs to a clamped 0..1 ratio.
3. _build_success_analysis now:
   - Uses GPU util as-is (already a ratio).
   - Derives CPU ratio = cpu_used_cores / parse(requested.cpu).
   - Derives memory ratio = mem_used_mib / parse(requested.memory).
   - Skips samples without a declared request (no 0/0 averaging).
4. _estimate_over_provisioned_count compares gpu_util against 0.40
   (ratio) instead of 40.0 — the old threshold was always true and
   over-counted every multi-GPU job.
5. _build_analysis_prompt renders proper units to the LLM:
   GPU as percentage (ratio*100%), CPU as "used/requested cores",
   memory as "used/requested MiB" — no more misleading "%" suffixes
   on raw cores/MiB values.

Verified with realistic fixtures: a job using 7.8/8 cores, 36189/32768
MiB, 0.2 GPU ratio now reports 97.5% / ~110% / 20% respectively, and
the over-provisioned detector flags it (was over-counting before).

Related Go-side bugs (out of scope for this commit):
- tools_readonly.go:1317 estimateActualGPUUsage divides gpu_util by 100
  a second time, leaving gpu_actual_used near 0 for most jobs.
- tools_readonly.go:1555-1557, :2174, :2200 format 0-1 GPU ratios with
  "%.1f%%" producing "0.2%" instead of "20.0%".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cover docs-figure/ch3/fig3-3-mops-layered-architecture.drawio (and
its png export) so the un-tracked file no longer shows up in
git status.

Also note: 5 tracked drawio files (ch1 figures, ch2 cloud-native
stack, ch3 mas-roles, ch5 dataset-distribution) have local edits
marked as skip-worktree to silence them locally — they're personal
thesis work, not part of the agent codebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	backend/cmd/gorm-gen/models/migrate.go
#	frontend/src/components/layout/detail-page.tsx
#	frontend/src/components/layout/page-title.tsx
#	frontend/src/i18n/locales/enUS/translation.json
#	frontend/src/i18n/locales/ja/translation.json
#	frontend/src/i18n/locales/ko/translation.json
Copilot AI review requested due to automatic review settings June 7, 2026 02:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants