Feature/crater agent#414
Open
MOONSakura0614 wants to merge 88 commits into
Open
Conversation
# Conflicts: # backend/cmd/gorm-gen/models/migrate.go # backend/docs/docs.go # backend/docs/swagger.json # backend/docs/swagger.yaml
k8s_get_events, k8s_describe_resource, k8s_get_pod_logs removed from ADMIN_ONLY_TOOL_NAMES. Ownership scoping added in follow-up tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
k8s_get_events, k8s_describe_resource, k8s_get_pod_logs now included in agentUserTools so non-admin users receive these tool schemas.
GET /api/agent/k8s-ownership?pod_name=X&user_id=Y checks whether the pod belongs to a job owned by the given user. Used by Python local executor to scope k8s read tools for non-admin users.
job.Name is display name; job.JobName is the k8s resource name that
pod names are derived from (e.g. {jobName}-worker-0). Fix ownership
check to use the correct field.
k8s_get_events, k8s_describe_resource, k8s_get_pod_logs now verify pod ownership via Go backend /api/agent/k8s-ownership before executing kubectl for non-admin users. Admin users bypass the check.
Tool schemas are delivered to LLM via bind_tools() API; listing all enabled_tools by name in prompt text was noisy and had no effect on tool selection. Confirm-only tools remain listed as behavioral hint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explicitly return 400 for user_id=0 to surface misconfiguration instead of silently returning allowed=false from a no-match DB query.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…leMetadata
Fixes the HTTP 500 that fired whenever the detail page tried to search by
session UUID and the several downstream UI issues reported:
- agent_audit.go: cast s.session_id to text in the ILIKE branch of the
keyword base query. Searching with a UUID string used to crash with
"operator does not exist: uuid ~~* unknown (SQLSTATE 42883)".
- Add GetSessionDetail(sessionID) service method + GET /admin/agent/
sessions/:sessionId/detail handler. Returns the same enriched
AgentAuditSessionListItem the list query builds, so the detail page
can fetch one session by ID instead of keyword-searching through a
paginated list. Route wired in RegisterAdmin.
- Aggregate orchestration modes per session via STRING_AGG(DISTINCT
orchestration_mode) over agent_turns, split to []string in Go.
AgentAuditSessionListItem gains OrchestrationModes field so the
detail page can show both single_agent and multi_agent when a
conversation used both across turns. last_orchestration_mode is kept
for compatibility.
- Refactor the list+detail query into buildAgentAuditSessionEnrichedQuery
so both code paths share the same SELECT and LEFT JOINs.
- vcjob/agent_submit.go: feed the agent submitter with the full
VolcanojobMgr service set (configService / queueQuotaSvc /
prequeueWatcher / billingService). Both SubmitJupyterJob and
SubmitTrainingJob now call req.validateScheduleOptions +
mgr.resolveJobScheduleMetadata and pass the resolved
*jobScheduleMetadata to getLabelAndAnnotations, matching the regular
/v1/vcjobs/* path so agent-created jobs carry the same prequeue /
backfill / tolerance annotations as user-created ones.
- Regenerate backend/docs/{docs.go,swagger.json,swagger.yaml} via
'make docs' so the swagger now reflects this branch's endpoints
(trigger-eval, quality-evals admin list, sessions/:id/detail) instead
of inheriting main's sloppy copy.
… left
Front-end follow-up to the backend uuid/detail/modes fixes.
- services/api/admin/agentAudit.ts:
* add apiAdminGetAgentAuditSessionDetail (GET /admin/agent/
sessions/:id/detail) so the detail page fetches a single session
directly instead of keyword-searching through the list API with
the full UUID (which hit the ILIKE-on-uuid bug).
* add AgentAuditSessionListItem.orchestrationModes so the detail
page can show every mode used across turns.
- routes/admin/more/agent-audit/$sessionId.tsx:
* switch the session-metadata query to the new detail endpoint.
* drop the top-left "返回列表" button (users still have the nav
breadcrumb + browser back; admin consoles don't need a second one).
* render OrchestrationModes as a badge list (single_agent + multi_agent
both appear when a session mixed modes across turns).
* compute message / tool-call / turn counts from the actual fetched
arrays as a fallback when the enriched JOIN is stale or zero.
* treat the detail query's error as "not found" so non-chat sessions
that previously 500'd now fall back cleanly.
- routes/admin/more/agent-audit/index.tsx: drop the mx-auto from the
feedback column's thumbs-up/down icons so the icon sits left-aligned
in its table cell, matching where the fallback dash starts. Before,
icons were centered but "-" was left-start and the two kinds of rows
visually jumped.
The AgentQualityEval table types turn_id as uuid, and PostgreSQL rejects
empty-string uuid literals ('22P02: invalid input syntax for type uuid:
""'). That made POST /admin/agent/sessions/:id/trigger-eval and the
feedback-driven path 500 whenever the caller did not provide a turn ID
(session-level eval from the admin audit panel, and thumbs-up/down on
the conversation as a whole).
CreateQualityEval now detects eval.TurnID == "" and uses tx.Omit("TurnID")
to exclude the column from the INSERT, letting the DB default (NULL)
apply. The GORM-returned *model.AgentQualityEval still has its ID
populated, so the downstream goroutine can still forward
{ eval_id, session_id, turn_id: "" } to crater-agent's
/eval/quality/session endpoint.
…rds drive filter
Detail page rebuild
-------------------
Replace the ad-hoc "big white card + tabs" layout with
@/components/layout/detail-page (the same DetailPage abstraction that
jobs/detail uses). The route now wires:
- header: PageTitle title + copyable session id + source badge
- info: 10 icon-label-value rows in the 2/4-col grid (owner, account,
orchestration modes, eval status, msgs/tools/turns counts with
colour-coded numbers, feedback direction, created/updated times via
TimeDistance)
- tabs: TabsList with underline/icon style - 对话时间线 / 质量评测 /
用户反馈 / 工具调用 / 执行轮次
- currentTab/setCurrentTab: URL search param ?tab=... via detailLinkOptions
(matches jobs pattern; tab state survives refreshes)
Also adds validateSearch + loader({ params }) => ({ crumb: short id }) so
the app breadcrumb now shows agent-audit -> <short-id> and the top-left
path navigation replaces the ad-hoc "返回列表" button that was removed
earlier.
List page overview ↔ filter bar wiring
--------------------------------------
The 4 source summary cards at the top are now fully interactive and are
the canonical source filter; source is dropped from the toolbar's
faceted filters to avoid duplication. Click a card to filter the
DataTable below to that source; click again to clear. Each card gains
an accent colour (blue/amber/emerald/fuchsia) tied to the source,
plus a coloured icon chip, so they read as filter buttons rather than
static stats.
Implementation wise, we use two useQuery hooks with the same queryKey
(one transforms to summary, one to items via source filter in select)
- TanStack Query dedupes the fetch and lets React re-render only the
items path when sourceFilter changes.
Number column colour
--------------------
msgs / tools / turns counts in the list (and the corresponding info
rows on the detail page) are now coloured sky / orange / violet so
the eye can parse them at a glance instead of blending into
everything else.
Add tool-loop stop-signal detection (k8s top, diagnose_job, capacity coverage), diagnostic evidence request matching, and follow-up prompt tightening across coordinator/planner/explorer/executor/verifier and intent router. Plan-execute orchestrator follows the same prompt profile split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relocates 智能运维 navigation from primary groups into the 更多 menu on both admin and portal sidebars, and adds the missing crumb loader so the AIOps page now renders its breadcrumb correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- llm-clients.json: register glm-4.5 and qvq-max-2025-03-25 entries - .gitignore: ignore local generated artifacts (output/, nested crater-agent/crater-agent/, .drawio.bkp, .venv/, __pycache__/) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bring in offline thesis materials so the repo carries the same
references the agent work is being designed against:
- Chapter3.md, Chapter4.md: rewrite drafts in repo root
- docs/zh-CN/thesis-mops-ch3-ch4-rewrite.{md (already tracked),pdf}
- docs/zh/, docs/0511/, docs/thesis-v2/: prior single-chapter and
multi-version drafts kept for cross-reference
- crater-agent/docs0511/: per-chapter draft set used while reshaping
the agent service
- docs-figure/: drawio sources and PNG exports for ch1-ch4 figures
- mops_arch.drawio[.png], system_arch.drawio[.png],
mops-architecture.drawio: top-level architecture diagrams
Also extend .gitignore for drawio editor temp files (*.drawio.dtmp,
.$*.drawio.dtmp) that the editor leaves behind.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backend: - New PUT /api/v1/agent/sessions/:sessionId/title endpoint - AgentSessionTitleRequest type and UpdateSessionTitle handler with owner check (mirrors UpdateSessionPin pattern) - Service.UpdateSessionTitle now trims whitespace, rejects empty input, truncates to 100 runes, and uses UpdateColumn so the session's updated_at timestamp is NOT bumped on rename — renames do not count as new activity for sort order. Frontend: - apiRenameSession(sessionId, title) in services/api/agent.ts - AIChatDrawer session list item: - New Edit (Pencil) icon next to Pin, only visible on group-hover - Click → title cell turns into inline Input prefilled with current title - Enter or blur commits via apiRenameSession; Esc cancels - Loading spinner during save; toast on failure - Empty / unchanged input drops back to view mode without API call - 100-char client-side maxLength matches backend truncation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Carry forward the .gitignore rules added on the demo branch: mindmap.drawio files, the zhisuan-platform-ops-challenges brainstorm diagrams, and several figure-source rewrites that stay in the working tree only — they're personal notes, not part of the agent codebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The success-analysis `resource_efficiency` block was wrong on all three dimensions because it treated every field as a 0-100 percentage when in fact the backend's Prometheus queries (backend/pkg/monitor/query.go) emit different units: - gpu_util_avg = 0.0-1.0 ratio (DCGM_FI_DEV_GPU_UTIL / 100 server-side) - cpu_usage_avg = CPU cores absolute (irate of cpu_usage_seconds_total) - mem_usage_avg = MiB absolute (container_memory_usage_bytes / 1Mi) Previously each value was divided by 100 again, producing: - GPU 利用率 0.2% (real: 20%) - 平均内存利用率 36189.3% (36189 MiB / 100; real: ~110% of a 32Gi request) - CPU 7.8% looked plausible only by coincidence (7.8 cores out of 8) Fixes: 1. New _parse_cpu_cores / _parse_memory_mib helpers parse Kubernetes resource.Quantity strings (Ki/Mi/Gi/Ti binary + K/M/G/T/P decimal + 500m for CPU + raw bytes). 2. New _normalize_gpu_util_ratio collapses both ratio (0-1) and legacy percent (0-100) inputs to a clamped 0..1 ratio. 3. _build_success_analysis now: - Uses GPU util as-is (already a ratio). - Derives CPU ratio = cpu_used_cores / parse(requested.cpu). - Derives memory ratio = mem_used_mib / parse(requested.memory). - Skips samples without a declared request (no 0/0 averaging). 4. _estimate_over_provisioned_count compares gpu_util against 0.40 (ratio) instead of 40.0 — the old threshold was always true and over-counted every multi-GPU job. 5. _build_analysis_prompt renders proper units to the LLM: GPU as percentage (ratio*100%), CPU as "used/requested cores", memory as "used/requested MiB" — no more misleading "%" suffixes on raw cores/MiB values. Verified with realistic fixtures: a job using 7.8/8 cores, 36189/32768 MiB, 0.2 GPU ratio now reports 97.5% / ~110% / 20% respectively, and the over-provisioned detector flags it (was over-counting before). Related Go-side bugs (out of scope for this commit): - tools_readonly.go:1317 estimateActualGPUUsage divides gpu_util by 100 a second time, leaving gpu_actual_used near 0 for most jobs. - tools_readonly.go:1555-1557, :2174, :2200 format 0-1 GPU ratios with "%.1f%%" producing "0.2%" instead of "20.0%". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cover docs-figure/ch3/fig3-3-mops-layered-architecture.drawio (and its png export) so the un-tracked file no longer shows up in git status. Also note: 5 tracked drawio files (ch1 figures, ch2 cloud-native stack, ch3 mas-roles, ch5 dataset-distribution) have local edits marked as skip-worktree to silence them locally — they're personal thesis work, not part of the agent codebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # backend/cmd/gorm-gen/models/migrate.go # frontend/src/components/layout/detail-page.tsx # frontend/src/components/layout/page-title.tsx # frontend/src/i18n/locales/enUS/translation.json # frontend/src/i18n/locales/ja/translation.json # frontend/src/i18n/locales/ko/translation.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.