Add Claude Code automation layer for MAD (skills, agents, workflows)#160
Open
coketaste wants to merge 17 commits into
Open
Add Claude Code automation layer for MAD (skills, agents, workflows)#160coketaste wants to merge 17 commits into
coketaste wants to merge 17 commits into
Conversation
Add a foundation pack so common MAD tasks (benchmarking, adding models, tuning, development) can be driven through Claude Code with the repo's conventions baked in: - CLAUDE.md: models.json schema, 4-step add-model flow, the "performance: <value> <unit>" stdout contract, madengine v2.1.0 CLI commands, deployment inference, and profiling. - .claude/agents/: mad-model-author, mad-perf-analyst, mad-benchmark-runner, mad-tuner. - .claude/commands/: mad-add-model, mad-benchmark, mad-profile, mad-report, mad-tune. - .claude/workflows/: mad-benchmark-sweep and mad-tune-search dynamic workflows (plan-only by default; execute:true on a GPU host). - .claude/settings.json: shared read-only/common command allowlist. Commands and docs verified against the installed madengine Typer CLI v2.1.0 (@main): top-level build/run/discover/report/database. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- mad-benchmark-sweep: drop the non-functional precision axis (precision is fixed per-model via training_precision or baked into the inference image, not a runtime flag) and give each cell its own perf_<cell>.csv to avoid clobbering a shared perf.csv under parallel execution; flag unresolved/errored cells. - mad-model-author: document the multiple_results output contract alongside the performance: stdout line, and confirm new entries resolve via discover. - Add /mad-validate: GPU-free static checker (JSON, paths, Dockerfile CONTEXT header, output contract) with errors vs convention-warning severities. - Point profiling agent/command at scripts/common/tools.json as the source of truth and document the deploy-key convention. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a Claude Code automation layer for MAD so common benchmarking, profiling, reporting, tuning, validation, and model-authoring tasks can be driven through slash commands, specialized agents, and two workflow scripts.
Changes:
- Adds Claude Code command prompts, agent definitions, workflow scripts, and permission settings under
.claude/. - Adds repository guidance in
CLAUDE.mdfor MAD conventions and madengine usage. - Adds a standalone HTML how-to guide documenting the automation layer.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
.claude/agents/mad-benchmark-runner.md |
Defines benchmark/profile command assembly and execution behavior. |
.claude/agents/mad-model-author.md |
Defines model scaffolding guidance for new MAD entries. |
.claude/agents/mad-perf-analyst.md |
Defines read-only benchmark result analysis behavior. |
.claude/agents/mad-tuner.md |
Defines iterative tuning behavior. |
.claude/commands/mad-add-model.md |
Adds slash command prompt for adding models. |
.claude/commands/mad-benchmark.md |
Adds slash command prompt for benchmark runs. |
.claude/commands/mad-profile.md |
Adds slash command prompt for profiled benchmark runs. |
.claude/commands/mad-report.md |
Adds slash command prompt for result analysis. |
.claude/commands/mad-tune.md |
Adds slash command prompt for tuning. |
.claude/commands/mad-validate.md |
Adds static validation command for MAD model entries. |
.claude/settings.json |
Adds Claude Code tool/command permission allow-list. |
.claude/workflows/mad-benchmark-sweep.js |
Adds parallel benchmark sweep workflow. |
.claude/workflows/mad-tune-search.js |
Adds candidate-based tuning search workflow. |
CLAUDE.md |
Adds Claude Code repository guidance for MAD. |
mad-automation-howto.html |
Adds user-facing HTML guide for commands, agents, workflows, and tips. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…g & tuning Brings Claude Code's multi-agent automation to MAD: plain slash commands now drive the full benchmark/tune lifecycle, with dynamic workflows that fan out specialized subagents and synthesize their results - turning manual, flag-heavy madengine runs into one-line, self-orchestrating operations. Headline capabilities: - mad-benchmark-sweep: a dynamic workflow that benchmarks many models in parallel and auto-builds a comparison table, with isolated per-cell output so parallel runs never clobber each other. - mad-tune-search: an agentic, profiling-driven tuning loop. It profiles once to DIAGNOSE the real bottleneck (compute/memory/communication/launch), proposes evidence-backed candidates, measures each on clean runs, and has an independent agent adversarially verify every claimed gain before recommending a config - decisions grounded in data, not guesswork. Engineering changes that make this work: - Both workflows now accept the CLI-style flags the slash commands actually pass (--tags / --additional-context / --plan); the prior object-only parsing silently dropped tags and context and defaulted to plan-only. They thread --additional-context into every run and execute by default. - mad-tune-search splits one context into a profiled variant (Diagnose) and a clean variant (measurement), evaluates candidates sequentially to avoid GPU contention and config-edit races, and isolates output to perf_tune_<id>.csv. - mad-automation-howto.html updated to match: CLI-flag tables, execute-by-default callouts, the 5-phase tune-search, and a bottleneck-to-lever reference table. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add real results and operational lessons from the Qwen3-8B profiling and tuning session on MI350X: - TunableOp warmup warning: first cold run collapses throughput to ~0.2-1 tok/s during online GEMM benchmarking; measurements are only valid on warm subsequent runs. Added to env var card and Tips section. - New red callout in tune-search: Docker containers outlive the Claude session, so a slow candidate can stall the whole sequential search and leave a config edit (e.g. extended.yaml) unapplied. Always pass --timeout to bound each candidate run. - New Tips entry "Qwen3-8B live tuning findings": rocm_trace_lite kernel breakdown (Cijk GEMM 27% + wvSplitK 21% = compute-bound), and the headline result -- max_concurrency=32 delivered 7993 tok/s throughput vs 422 tok/s baseline (18.9x), confirming concurrency as the primary vLLM serving lever. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Correct the architecture SVG so all 6 slash commands, both workflows, and the inline /mad-validate path are represented accurately. Give /mad-profile its own box, route /mad-validate to an inline-script node, split the two workflows with distinct fan-out targets, and color-code arrows with a legend. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The Future section described these paths as existing scaffolding, but they are not in the repo. Reword to "planned" so Claude Code sessions do not assume the files/data are available. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The benchmark-runner agent claimed no dummy model exists, but models.json has dummy_multi (tag "dummies"). Point the agent at it as the lightweight smoke-test target. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Without these, benchmark/profile/tuning agents in execute mode require manual approval for every run command, blocking automation on GPU hosts. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…flows Before any madengine invocation, agents now check whether madengine is on PATH. If missing and requirements.txt is present (i.e. we're in the MAD repo), they auto-install via pip. Otherwise they print clear install/clone instructions and halt. A secondary check warns when models.json is absent, catching "wrong directory" mistakes early. Affected: mad-benchmark-runner, mad-tuner, mad-model-author agents; mad-benchmark, mad-profile, mad-tune, mad-add-model commands; mad-benchmark-sweep and mad-tune-search workflows. Not changed: mad-perf-analyst, mad-report, mad-validate (read-only/GPU-free). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Replaces the 6 legacy .claude/commands/mad-*.md with proper Skills under
.claude/skills/, following the current Claude Code best practice where
commands and skills are unified and skills are the recommended form.
Changes:
- Add .claude/skills/mad-{benchmark,profile,tune,add-model}/SKILL.md with
disable-model-invocation:true (manual-only; these build/run on AMD GPUs)
- Add .claude/skills/mad-{report,validate}/SKILL.md as auto-invocable
(read-only, safe for Claude to trigger from plain-English intent)
- All 6 skills use context:fork to dispatch to their curated subagent,
collapsing the old command→subagent indirection into one file per skill
- Add .claude/skills/mad-common/preflight.sh — single copy of the madengine
install + repo-root check, injected via dynamic context (was duplicated 7x)
- Add .claude/skills/mad-validate/scripts/validate.py — extracts the
embedded Python heredoc into a real bundled file (tested: 142 models, 0 errors)
- Slim the 4 agents to pure system prompts (role + tools + standing rules);
per-invocation steps and pre-flight blocks move to the skills
- Delete all 6 .claude/commands/mad-*.md (skills supersede them with the
same /mad-* names; existing muscle memory unchanged)
- Update CLAUDE.md to reference the new skills layout
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
1. Preflight path: replace relative `bash .claude/skills/mad-common/preflight.sh`
with `bash ${CLAUDE_SKILL_DIR}/../mad-common/preflight.sh` in the 4 GPU skills so
the injection works regardless of cwd (not just when at the repo root).
Update allowed-tools from the specific path to `Bash(bash *)` to match.
2. $0 truncation: replace `$0` with `$ARGUMENTS` throughout mad-add-model — $0
only expands to the first whitespace-delimited token, silently truncating model
names if the user passes extra unquoted context words.
3. /mad-validate in forked agent: replace the `/mad-validate $ARGUMENTS` slash-
command call (which doesn't work inside a forked subagent) with an explicit
`python3 .claude/skills/mad-validate/scripts/validate.py "$ARGUMENTS"` invocation.
4. validate.py cwd: add git rev-parse + os.chdir(repo_root) so validate.py resolves
models.json correctly even when called from a subdirectory.
5. settings.json: add pre-approvals for `preflight.sh` and `validate.py` so a fresh
clone doesn't hit permission prompts on first use.
6. Workflow preflight dedup: replace the inline madengine-install shell block in
mad-benchmark-sweep.js and mad-tune-search.js with a reference to preflight.sh.
Also add the missing preflight check to the Diagnose and Evaluate phases in
mad-tune-search.js (previously only the Baseline phase had it).
7. Hardcoded path: remove `/home/ysha/MAD` absolute path from mad-benchmark-sweep.js.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Add Claude Code Integration section documenting the /mad-* skills, their invocation mode (manual vs auto), and the two workflows - Add /mad-add-model callout in the Contributing > Adding New Models section so contributors know the automated path exists - Fix factual error: timeout -1 (not 0) disables the timeout entirely, matching the madengine models.json schema documented in CLAUDE.md Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Migrate mad-* slash commands to Claude Code Skills
Comment on lines
+5
to
+23
| set -u | ||
|
|
||
| if ! command -v madengine &>/dev/null; then | ||
| if [ -f requirements.txt ] && grep -q madengine requirements.txt; then | ||
| echo "[pre-flight] madengine not found. Installing from requirements.txt..." | ||
| pip install -r requirements.txt | ||
| else | ||
| echo "[pre-flight] madengine not found and requirements.txt is missing." | ||
| echo " Install: pip install git+https://github.com/ROCm/madengine.git@main" | ||
| echo " Or clone MAD and run from its root (which has requirements.txt)." | ||
| exit 1 | ||
| fi | ||
| fi | ||
|
|
||
| if [ ! -f models.json ]; then | ||
| echo "[pre-flight] Warning: models.json not found — run from the MAD repo root." | ||
| fi | ||
|
|
||
| echo "[pre-flight] OK: madengine=$(command -v madengine), cwd=$(pwd)" |
Comment on lines
+136
to
+138
| const cleanCtxFlag = cleanCtx ? ` --additional-context '${cleanCtx}'` : '' | ||
| const profCtxFlag = profiledCtx ? ` --additional-context '${profiledCtx}'` | ||
| : ` --additional-context '{"tools": [{"name": "${profileToolName || 'rocm_trace_lite'}"}]}'` |
Comment on lines
+134
to
+147
| let ctx = '' | ||
| if (addlCtx) { | ||
| let merged = addlCtx | ||
| if (cell.nGpus) { | ||
| try { | ||
| const obj = JSON.parse(addlCtx) | ||
| obj.n_gpus = cell.nGpus | ||
| merged = JSON.stringify(obj) | ||
| } catch (e) { /* leave raw; n_gpus axis ignored for this cell */ } | ||
| } | ||
| ctx = ` --additional-context '${merged}'` | ||
| } else if (cell.nGpus) { | ||
| ctx = ` --additional-context '{"n_gpus": "${cell.nGpus}"}'` | ||
| } |
Comment on lines
+15
to
+24
| # Resolve the repo root so the script works regardless of cwd. | ||
| try: | ||
| repo_root = subprocess.check_output( | ||
| ["git", "rev-parse", "--show-toplevel"], text=True, stderr=subprocess.DEVNULL | ||
| ).strip() | ||
| os.chdir(repo_root) | ||
| except Exception: | ||
| pass # fall back to cwd; works when already at repo root | ||
|
|
||
| models = json.load(open("models.json")) |
Comment on lines
+19
to
+20
| The model name is the first token of `$ARGUMENTS` (e.g. `pyt_vllm_qwen3-8b`). Use | ||
| `$ARGUMENTS` where the full model name is needed — do not split on spaces. |
|
|
||
| ## Claude Code Integration | ||
|
|
||
| MAD ships with a set of `/mad-*` skills for [Claude Code](https://claude.ai/code) that cover the four most common tasks. See `CLAUDE.md` for full context and conventions. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary