
CommBench: can LLMs write correct and efficient GPU communication code?
Blog | Join Slack | Twitter/X | Leaderboard | Quick Start | Contributing
- A benchmark for multi-device GPU communication code. 100+ independently runnable examples spanning P2P, collective, expert-parallel (EP), compute–communication fusion, and utilities, each labeled Easy / Medium / Hard.
- Curated by GPU communication experts. Every example is hand-written by GPU communication experts or expert-extracted from production codebases including Mscclpp, NCCL, NVSHMEM, DeepEP, ThunderKittens, vLLM, and SGLang.
- Cheat-resistant evaluation harness. Randomized test inputs, edit-region verification, and hidden build scripts prevent hard-coding and library substitution.
- Correctness and performance. Generated code is compiled, run, and compared against a hand-written reference on both correctness and speed, with optional multi-round refinement from compile/run feedback.
Today's frontier LLMs write excellent single-device code yet consistently fail on multi-device GPU communication, precisely the code that bottlenecks large-scale LLM training and inference. CommBench measures, and aims to help close, that gap.
empty_*.cu/cpp LLM (GPT, Gemini, Claude, Grok, GLM, Qwen …) generated_*.cu/cpp
┌──────────────┐ ┌──────────┐ ┌──────────────────┐
│ // TODO │ ── prompt ──▶│ Model │── code response──▶│ filled-in code │
│ // TODO │ └──────────┘ └──────────────────┘
└──────────────┘ │
▼
build & run & compare
│
▼
summary.json + plots
- Prompt — reads the
empty_*template (reference code with core logic stripped to// TODO), builds a completion prompt. - Generate — sends the prompt to an LLM. With
--max-rounds > 1, compile/run errors are fed back for self-correction. - Evaluate — compiles and runs both reference and generated code, compares correctness and performance (latency, throughput), and produces CSV, plots, and a
summary.json. Supported models: OpenAI GPT, Google Gemini, Anthropic Claude, Grok, DeepSeek, GLM, and Qwen.
| Category | What it covers |
|---|---|
| P2P | Point-to-point transfer between a pair of devices |
| Collective | Group operations across all ranks (AllReduce, AllGather, All-to-All, …) |
| EP | Dynamic, non-uniform dispatch/combine traffic for MoE models |
| Fusion | Kernels that interleave communication with compute (e.g., AllGather+GEMM) |
| Utilities | Connection setup, buffer registration, topology queries, GPU–CPU FIFO queues |
By source, examples span cuda-runtime, ibverbs, Mscclpp, NCCL, NVSHMEM, DeepEP, the NCCL device API, ThunderKittens, vLLM, and SGLang.
Sorted by Pass×GM ⭐ — pass rate scaled by geometric-mean code quality on passing examples. See the blog post for full metric definitions and case studies.
| Rank | Model | Pass×GM | Pass Rate | PASS+Good | GM‑Speedup | Open Source | Price |
|---|---|---|---|---|---|---|---|
| 🥇 | gpt-5.5 | 0.467 | 57.4% | 30.7% | 0.813 | ❌ | $1.91 |
| 🥈 | gemini-3.1-pro-preview | 0.305 | 36.6% | 25.7% | 0.832 | ❌ | $0.26 |
| 🥉 | claude-opus-4-7 | 0.282 | 33.7% | 20.8% | 0.836 | ❌ | $0.21 |
| 4️⃣ | glm-5.1 | 0.281 | 29.7% | 17.8% | 0.947 | ✅ | $0.63 |
| 5️⃣ | kimi-k2.6 | 0.275 | 30.7% | 18.8% | 0.895 | ✅ | $0.10 |
| 6️⃣ | qwen3.7-max | 0.269 | 26.7% | 15.8% | 1.008 | ❌ | $0.03 |
| 7️⃣ | deepseek-v4-pro | 0.197 | 19.8% | 12.9% | 0.995 | ✅ | $0.02 |
| Metric | Formula | What it measures |
|---|---|---|
| Pass×GM ⭐ | Pass Rate × GM‑Speedup | Pass rate scaled by geometric-mean code quality on passing examples. Primary ranking metric. |
| Pass Rate | PASS / Total | Fraction of examples where code compiled, ran, and produced correct results. |
| PASS+Good | (on_compare + better) / Total | Fraction of all examples with correct and performant code (within −5% of reference). |
| GM‑Speedup | GM of per-example speedup scores | Geometric mean of generated-vs-reference performance ratios over passing examples, taken across measured data sizes so each data point contributes equally. Computed only over passing examples, so a model that passes more (often harder) examples with mediocre performance can score lower; we therefore rank by Pass×GM. |
| Price | — | Average cost per example (USD). |
A detailed comparison of the highest- and lowest-scoring models, gpt-5.5 (Pass×GM = 0.467) and deepseek-v4-pro (Pass×GM = 0.197).
| GPT-5.5 | DeepSeek-V4-Pro |
![]() |
![]() |
| GPT-5.5 | DeepSeek-V4-Pro |
![]() |
![]() |
| GPT-5.5 | DeepSeek-V4-Pro |
![]() |
![]() |
Key findings:
- Even the strongest model passes under 60% of examples and produces performant code on only a third.
- Every model collapses to near-zero coverage on specialized libraries such as Mscclpp, ThunderKittens, and the NCCL device API. Models hallucinate APIs, misplace synchronization, and ship kernels orders of magnitude slower than reference.
- Multi-round self-correction helps only on commodity libraries and easier tasks. Giving deepseek-v4-pro 5 rounds raises its pass rate from 15.8% to 41.6%, but unlocks neither Hard examples nor specialized libraries.
# 1. Install dependencies
pip install -r requirements.txt
# 2. Set at least one API key
export OPENAI_API_KEY="..." # → --model gpt-4o
export GOOGLE_API_KEY="..." # → --model gemini-3-pro-preview
export ANTHROPIC_API_KEY="..." # → --model claude-sonnet-4-5-20250929
# 3. Verify GPU compiler
nvcc --version # NVIDIA
hipcc --version # AMD
# 4. Run an evaluation
python scripts/generate_eval_one.py example001_gpu_comm_single_process \
--model gpt-4o \
--max-rounds 3This reads the empty_* template, sends it to the model, compiles and runs the generated code, compares it against the reference, and writes a summary.json with correctness and performance results.
python scripts/generate_eval_one.py <example_name> [options]
Options:
--model MODEL LLM to use (default: gpt-4o)
--max-rounds N Max generation rounds with error feedback (default: 1)
--temperature FLOAT Sampling temperature (default: 0.3)
--datasets-dir PATH Base datasets directory (default: ./datasets)
--no-save Don't save generated code
--quiet Suppress detailed outputA single example's build_and_run.py can also be run directly:
python build_and_run.py --source ref_gpu_p2p_comm.cpp # build & run reference
python build_and_run.py --compare ref_gpu_p2p_comm.cpp generated_gpu_p2p_comm.cpp # compare ref vs generatedTo contribute a new example, please follow the requirements in the Dataset Instructions.
We thank Mibura and AMD for sponsoring the testbed for this benchmark.
MIT License





