Skip to content

Add Claude Code automation layer for MAD (skills, agents, workflows)#160

Open
coketaste wants to merge 17 commits into
ROCm:developfrom
coketaste:coketaste/claude-workflow
Open

Add Claude Code automation layer for MAD (skills, agents, workflows)#160
coketaste wants to merge 17 commits into
ROCm:developfrom
coketaste:coketaste/claude-workflow

Conversation

@coketaste

Copy link
Copy Markdown
Contributor

Summary

Adds a Claude Code automation layer on top of `madengine` so common MAD
tasks — benchmarking, adding models, profiling, tuning, and analysis — can be
driven by slash commands and multi-agent workflows instead of hand-assembled
CLI invocations.

- **Slash commands + agents**: `/mad-benchmark`, `/mad-add-model`, `/mad-profile`,
  `/mad-report`, `/mad-tune`, `/mad-validate`, backed by dedicated subagents
  (benchmark-runner, model-author, perf-analyst, tuner).
- **Workflows**: `mad-benchmark-sweep` (fan out a benchmark matrix across tags,
  parallel, synthesize a comparison table) and `mad-tune-search` (profile-driven
  tuning search with adversarial verification).
- **Static validator** for `models.json` entries (JSON, paths, Dockerfile header,
  performance-line contract) — no GPU required.
- **How-to docs**: `mad-automation-howto.html`.

## Key design decisions

- **Workflows accept the same CLI-style flags as `/mad-benchmark`**
  (`--tags`, `--additional-context`, `--ngpus`, `--timeout`, `--plan`), and thread
  `--additional-context` (HF token, `docker_env_vars`) into every cell. They
  execute by default; `--plan`/`--no-execute` for a GPU-free dry run.
- **`mad-benchmark-sweep` runs cells in parallel** with per-cell `perf_<tag>.csv`
  isolation (no shared-file clobbering); precision is intentionally not a sweep
  axis (it's fixed per model in `models.json`).
- **`mad-tune-search` is profile-driven**: a Diagnose phase profiles the model
  once to classify the bottleneck (compute/memory/communication/launch) from
  kernel hotspots, proposals must cite that evidence, and candidates are then
  measured **cleanly and sequentially** (profiler stripped) so deltas aren't
  buried in tracing overhead or corrupted by GPU contention / parallel file edits.
- Both workflows handle `multiple_results` models (per-metric CSV instead of a
  stdout `performance:` line).

## Test plan

- [ ] `madengine discover --tags <tag>` resolves for the documented examples
- [ ] `/mad-benchmark-sweep --tags ... --plan` validates without a GPU
- [ ] `/mad-benchmark-sweep --tags <tag> --additional-context '{...}'` executes
      on a GPU host and writes `perf_<tag>.csv`
- [ ] `/mad-tune-search --tag <model> --plan` produces a profiling-informed
      candidate list
- [ ] `/mad-tune-search --tag <model> --additional-context '{...}'` diagnoses
      the bottleneck and measures candidates cleanly
- [ ] `/mad-validate` passes for all `models.json` entries
- [ ] `mad-automation-howto.html` renders and matches current workflow behavior

coketaste and others added 3 commits May 31, 2026 12:51
Add a foundation pack so common MAD tasks (benchmarking, adding models,
tuning, development) can be driven through Claude Code with the repo's
conventions baked in:

- CLAUDE.md: models.json schema, 4-step add-model flow, the
  "performance: <value> <unit>" stdout contract, madengine v2.1.0 CLI
  commands, deployment inference, and profiling.
- .claude/agents/: mad-model-author, mad-perf-analyst, mad-benchmark-runner,
  mad-tuner.
- .claude/commands/: mad-add-model, mad-benchmark, mad-profile, mad-report,
  mad-tune.
- .claude/workflows/: mad-benchmark-sweep and mad-tune-search dynamic
  workflows (plan-only by default; execute:true on a GPU host).
- .claude/settings.json: shared read-only/common command allowlist.

Commands and docs verified against the installed madengine Typer CLI v2.1.0
(@main): top-level build/run/discover/report/database.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- mad-benchmark-sweep: drop the non-functional precision axis (precision is
  fixed per-model via training_precision or baked into the inference image, not
  a runtime flag) and give each cell its own perf_<cell>.csv to avoid clobbering
  a shared perf.csv under parallel execution; flag unresolved/errored cells.
- mad-model-author: document the multiple_results output contract alongside the
  performance: stdout line, and confirm new entries resolve via discover.
- Add /mad-validate: GPU-free static checker (JSON, paths, Dockerfile CONTEXT
  header, output contract) with errors vs convention-warning severities.
- Point profiling agent/command at scripts/common/tools.json as the source of
  truth and document the deploy-key convention.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@coketaste coketaste self-assigned this Jun 1, 2026
Copilot AI review requested due to automatic review settings June 1, 2026 02:43
@coketaste coketaste marked this pull request as draft June 1, 2026 02:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Claude Code automation layer for MAD so common benchmarking, profiling, reporting, tuning, validation, and model-authoring tasks can be driven through slash commands, specialized agents, and two workflow scripts.

Changes:

  • Adds Claude Code command prompts, agent definitions, workflow scripts, and permission settings under .claude/.
  • Adds repository guidance in CLAUDE.md for MAD conventions and madengine usage.
  • Adds a standalone HTML how-to guide documenting the automation layer.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
.claude/agents/mad-benchmark-runner.md Defines benchmark/profile command assembly and execution behavior.
.claude/agents/mad-model-author.md Defines model scaffolding guidance for new MAD entries.
.claude/agents/mad-perf-analyst.md Defines read-only benchmark result analysis behavior.
.claude/agents/mad-tuner.md Defines iterative tuning behavior.
.claude/commands/mad-add-model.md Adds slash command prompt for adding models.
.claude/commands/mad-benchmark.md Adds slash command prompt for benchmark runs.
.claude/commands/mad-profile.md Adds slash command prompt for profiled benchmark runs.
.claude/commands/mad-report.md Adds slash command prompt for result analysis.
.claude/commands/mad-tune.md Adds slash command prompt for tuning.
.claude/commands/mad-validate.md Adds static validation command for MAD model entries.
.claude/settings.json Adds Claude Code tool/command permission allow-list.
.claude/workflows/mad-benchmark-sweep.js Adds parallel benchmark sweep workflow.
.claude/workflows/mad-tune-search.js Adds candidate-based tuning search workflow.
CLAUDE.md Adds Claude Code repository guidance for MAD.
mad-automation-howto.html Adds user-facing HTML guide for commands, agents, workflows, and tips.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .claude/workflows/mad-benchmark-sweep.js
Comment thread .claude/workflows/mad-benchmark-sweep.js Outdated
Comment thread .claude/workflows/mad-benchmark-sweep.js Outdated
Comment thread .claude/workflows/mad-tune-search.js
Comment thread .claude/workflows/mad-tune-search.js Outdated
Comment thread .claude/commands/mad-validate.md Outdated
Comment thread CLAUDE.md Outdated
Comment thread .claude/agents/mad-benchmark-runner.md Outdated
Comment thread .claude/settings.json
coketaste and others added 13 commits June 1, 2026 03:43
…g & tuning

Brings Claude Code's multi-agent automation to MAD: plain slash commands now
drive the full benchmark/tune lifecycle, with dynamic workflows that fan out
specialized subagents and synthesize their results - turning manual, flag-heavy
madengine runs into one-line, self-orchestrating operations.

Headline capabilities:
- mad-benchmark-sweep: a dynamic workflow that benchmarks many models in
  parallel and auto-builds a comparison table, with isolated per-cell output
  so parallel runs never clobber each other.
- mad-tune-search: an agentic, profiling-driven tuning loop. It profiles once
  to DIAGNOSE the real bottleneck (compute/memory/communication/launch),
  proposes evidence-backed candidates, measures each on clean runs, and has an
  independent agent adversarially verify every claimed gain before recommending
  a config - decisions grounded in data, not guesswork.

Engineering changes that make this work:
- Both workflows now accept the CLI-style flags the slash commands actually
  pass (--tags / --additional-context / --plan); the prior object-only parsing
  silently dropped tags and context and defaulted to plan-only. They thread
  --additional-context into every run and execute by default.
- mad-tune-search splits one context into a profiled variant (Diagnose) and a
  clean variant (measurement), evaluates candidates sequentially to avoid GPU
  contention and config-edit races, and isolates output to perf_tune_<id>.csv.
- mad-automation-howto.html updated to match: CLI-flag tables, execute-by-default
  callouts, the 5-phase tune-search, and a bottleneck-to-lever reference table.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add real results and operational lessons from the Qwen3-8B profiling and
tuning session on MI350X:

- TunableOp warmup warning: first cold run collapses throughput to
  ~0.2-1 tok/s during online GEMM benchmarking; measurements are only
  valid on warm subsequent runs. Added to env var card and Tips section.
- New red callout in tune-search: Docker containers outlive the Claude
  session, so a slow candidate can stall the whole sequential search and
  leave a config edit (e.g. extended.yaml) unapplied. Always pass
  --timeout to bound each candidate run.
- New Tips entry "Qwen3-8B live tuning findings": rocm_trace_lite kernel
  breakdown (Cijk GEMM 27% + wvSplitK 21% = compute-bound), and the
  headline result -- max_concurrency=32 delivered 7993 tok/s throughput
  vs 422 tok/s baseline (18.9x), confirming concurrency as the primary
  vLLM serving lever.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Correct the architecture SVG so all 6 slash commands, both workflows, and
the inline /mad-validate path are represented accurately. Give /mad-profile
its own box, route /mad-validate to an inline-script node, split the two
workflows with distinct fan-out targets, and color-code arrows with a legend.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The Future section described these paths as existing scaffolding, but
they are not in the repo. Reword to "planned" so Claude Code sessions
do not assume the files/data are available.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The benchmark-runner agent claimed no dummy model exists, but
models.json has dummy_multi (tag "dummies"). Point the agent at it as
the lightweight smoke-test target.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Without these, benchmark/profile/tuning agents in execute mode require
manual approval for every run command, blocking automation on GPU hosts.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…flows

Before any madengine invocation, agents now check whether madengine is on
PATH. If missing and requirements.txt is present (i.e. we're in the MAD
repo), they auto-install via pip. Otherwise they print clear install/clone
instructions and halt. A secondary check warns when models.json is absent,
catching "wrong directory" mistakes early.

Affected: mad-benchmark-runner, mad-tuner, mad-model-author agents;
mad-benchmark, mad-profile, mad-tune, mad-add-model commands;
mad-benchmark-sweep and mad-tune-search workflows.
Not changed: mad-perf-analyst, mad-report, mad-validate (read-only/GPU-free).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Replaces the 6 legacy .claude/commands/mad-*.md with proper Skills under
.claude/skills/, following the current Claude Code best practice where
commands and skills are unified and skills are the recommended form.

Changes:
- Add .claude/skills/mad-{benchmark,profile,tune,add-model}/SKILL.md with
  disable-model-invocation:true (manual-only; these build/run on AMD GPUs)
- Add .claude/skills/mad-{report,validate}/SKILL.md as auto-invocable
  (read-only, safe for Claude to trigger from plain-English intent)
- All 6 skills use context:fork to dispatch to their curated subagent,
  collapsing the old command→subagent indirection into one file per skill
- Add .claude/skills/mad-common/preflight.sh — single copy of the madengine
  install + repo-root check, injected via dynamic context (was duplicated 7x)
- Add .claude/skills/mad-validate/scripts/validate.py — extracts the
  embedded Python heredoc into a real bundled file (tested: 142 models, 0 errors)
- Slim the 4 agents to pure system prompts (role + tools + standing rules);
  per-invocation steps and pre-flight blocks move to the skills
- Delete all 6 .claude/commands/mad-*.md (skills supersede them with the
  same /mad-* names; existing muscle memory unchanged)
- Update CLAUDE.md to reference the new skills layout

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
1. Preflight path: replace relative `bash .claude/skills/mad-common/preflight.sh`
   with `bash ${CLAUDE_SKILL_DIR}/../mad-common/preflight.sh` in the 4 GPU skills so
   the injection works regardless of cwd (not just when at the repo root).
   Update allowed-tools from the specific path to `Bash(bash *)` to match.

2. $0 truncation: replace `$0` with `$ARGUMENTS` throughout mad-add-model — $0
   only expands to the first whitespace-delimited token, silently truncating model
   names if the user passes extra unquoted context words.

3. /mad-validate in forked agent: replace the `/mad-validate $ARGUMENTS` slash-
   command call (which doesn't work inside a forked subagent) with an explicit
   `python3 .claude/skills/mad-validate/scripts/validate.py "$ARGUMENTS"` invocation.

4. validate.py cwd: add git rev-parse + os.chdir(repo_root) so validate.py resolves
   models.json correctly even when called from a subdirectory.

5. settings.json: add pre-approvals for `preflight.sh` and `validate.py` so a fresh
   clone doesn't hit permission prompts on first use.

6. Workflow preflight dedup: replace the inline madengine-install shell block in
   mad-benchmark-sweep.js and mad-tune-search.js with a reference to preflight.sh.
   Also add the missing preflight check to the Diagnose and Evaluate phases in
   mad-tune-search.js (previously only the Baseline phase had it).

7. Hardcoded path: remove `/home/ysha/MAD` absolute path from mad-benchmark-sweep.js.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- Add Claude Code Integration section documenting the /mad-* skills,
  their invocation mode (manual vs auto), and the two workflows
- Add /mad-add-model callout in the Contributing > Adding New Models
  section so contributors know the automated path exists
- Fix factual error: timeout -1 (not 0) disables the timeout entirely,
  matching the madengine models.json schema documented in CLAUDE.md

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Migrate mad-* slash commands to Claude Code Skills
@coketaste coketaste marked this pull request as ready for review June 9, 2026 21:21
Copilot AI review requested due to automatic review settings June 9, 2026 21:21
@coketaste coketaste changed the title Add Claude Code automation layer for MAD (commands, agents, workflows) Add Claude Code automation layer for MAD (skills, agents, workflows) Jun 9, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Comment on lines +5 to +23
set -u

if ! command -v madengine &>/dev/null; then
if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
echo "[pre-flight] madengine not found. Installing from requirements.txt..."
pip install -r requirements.txt
else
echo "[pre-flight] madengine not found and requirements.txt is missing."
echo " Install: pip install git+https://github.com/ROCm/madengine.git@main"
echo " Or clone MAD and run from its root (which has requirements.txt)."
exit 1
fi
fi

if [ ! -f models.json ]; then
echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
fi

echo "[pre-flight] OK: madengine=$(command -v madengine), cwd=$(pwd)"
Comment on lines +136 to +138
const cleanCtxFlag = cleanCtx ? ` --additional-context '${cleanCtx}'` : ''
const profCtxFlag = profiledCtx ? ` --additional-context '${profiledCtx}'`
: ` --additional-context '{"tools": [{"name": "${profileToolName || 'rocm_trace_lite'}"}]}'`
Comment on lines +134 to +147
let ctx = ''
if (addlCtx) {
let merged = addlCtx
if (cell.nGpus) {
try {
const obj = JSON.parse(addlCtx)
obj.n_gpus = cell.nGpus
merged = JSON.stringify(obj)
} catch (e) { /* leave raw; n_gpus axis ignored for this cell */ }
}
ctx = ` --additional-context '${merged}'`
} else if (cell.nGpus) {
ctx = ` --additional-context '{"n_gpus": "${cell.nGpus}"}'`
}
Comment on lines +15 to +24
# Resolve the repo root so the script works regardless of cwd.
try:
repo_root = subprocess.check_output(
["git", "rev-parse", "--show-toplevel"], text=True, stderr=subprocess.DEVNULL
).strip()
os.chdir(repo_root)
except Exception:
pass # fall back to cwd; works when already at repo root

models = json.load(open("models.json"))
Comment on lines +19 to +20
The model name is the first token of `$ARGUMENTS` (e.g. `pyt_vllm_qwen3-8b`). Use
`$ARGUMENTS` where the full model name is needed — do not split on spaces.
Comment thread README.md

## Claude Code Integration

MAD ships with a set of `/mad-*` skills for [Claude Code](https://claude.ai/code) that cover the four most common tasks. See `CLAUDE.md` for full context and conventions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants