GitHub - uccl-project/CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

CommBench: can LLMs write correct and efficient GPU communication code?

Highlights

A benchmark for multi-device GPU communication code. 100+ independently runnable examples spanning P2P, collective, expert-parallel (EP), compute–communication fusion, and utilities, each labeled Easy / Medium / Hard.
Curated by GPU communication experts. Every example is hand-written by GPU communication experts or expert-extracted from production codebases including Mscclpp, NCCL, NVSHMEM, DeepEP, ThunderKittens, vLLM, and SGLang.
Cheat-resistant evaluation harness. Randomized test inputs, edit-region verification, and hidden build scripts prevent hard-coding and library substitution.
Correctness and performance. Generated code is compiled, run, and compared against a hand-written reference on both correctness and speed, with optional multi-round refinement from compile/run feedback.

Today's frontier LLMs write excellent single-device code yet consistently fail on multi-device GPU communication, precisely the code that bottlenecks large-scale LLM training and inference. CommBench measures, and aims to help close, that gap.

How It Works

 empty_*.cu/cpp          LLM (GPT, Gemini, Claude, Grok, GLM, Qwen …)          generated_*.cu/cpp
 ┌──────────────┐              ┌──────────┐                   ┌──────────────────┐
 │  // TODO     │ ── prompt ──▶│  Model   │── code response──▶│  filled-in code  │
 │  // TODO     │              └──────────┘                   └──────────────────┘
 └──────────────┘                                                       │
                                                                        ▼
                                                              build & run & compare
                                                                        │
                                                                        ▼
                                                            summary.json + plots

Prompt — reads the empty_* template (reference code with core logic stripped to // TODO), builds a completion prompt.
Generate — sends the prompt to an LLM. With --max-rounds > 1, compile/run errors are fed back for self-correction.
Evaluate — compiles and runs both reference and generated code, compares correctness and performance (latency, throughput), and produces CSV, plots, and a summary.json. Supported models: OpenAI GPT, Google Gemini, Anthropic Claude, Grok, DeepSeek, GLM, and Qwen.

Benchmark Categories

Category	What it covers
P2P	Point-to-point transfer between a pair of devices
Collective	Group operations across all ranks (AllReduce, AllGather, All-to-All, …)
EP	Dynamic, non-uniform dispatch/combine traffic for MoE models
Fusion	Kernels that interleave communication with compute (e.g., AllGather+GEMM)
Utilities	Connection setup, buffer registration, topology queries, GPU–CPU FIFO queues

By source, examples span cuda-runtime, ibverbs, Mscclpp, NCCL, NVSHMEM, DeepEP, the NCCL device API, ThunderKittens, vLLM, and SGLang.

Leaderboard

Sorted by Pass×GM ⭐ — pass rate scaled by geometric-mean code quality on passing examples. See the blog post for full metric definitions and case studies.

Rank	Model	Pass×GM	Pass Rate	PASS+Good	GM‑Speedup	Open Source	Price
🥇	gpt-5.5	0.467	57.4%	30.7%	0.813	❌	$1.91
🥈	gemini-3.1-pro-preview	0.305	36.6%	25.7%	0.832	❌	$0.26
🥉	claude-opus-4-7	0.282	33.7%	20.8%	0.836	❌	$0.21
4️⃣	glm-5.1	0.281	29.7%	17.8%	0.947	✅	$0.63
5️⃣	kimi-k2.6	0.275	30.7%	18.8%	0.895	✅	$0.10
6️⃣	qwen3.7-max	0.269	26.7%	15.8%	1.008	❌	$0.03
7️⃣	deepseek-v4-pro	0.197	19.8%	12.9%	0.995	✅	$0.02

Metrics

Metric	Formula	What it measures
Pass×GM ⭐	Pass Rate × GM‑Speedup	Pass rate scaled by geometric-mean code quality on passing examples. Primary ranking metric.
Pass Rate	PASS / Total	Fraction of examples where code compiled, ran, and produced correct results.
PASS+Good	(on_compare + better) / Total	Fraction of all examples with correct and performant code (within −5% of reference).
GM‑Speedup	GM of per-example speedup scores	Geometric mean of generated-vs-reference performance ratios over passing examples, taken across measured data sizes so each data point contributes equally. Computed only over passing examples, so a model that passes more (often harder) examples with mediocre performance can score lower; we therefore rank by Pass×GM.
Price	—	Average cost per example (USD).

Top vs. Bottom Model

A detailed comparison of the highest- and lowest-scoring models, gpt-5.5 (Pass×GM = 0.467) and deepseek-v4-pro (Pass×GM = 0.197).

Difficulty Breakdown

GPT-5.5	DeepSeek-V4-Pro

Performance Quality Among PASS Examples

GPT-5.5	DeepSeek-V4-Pro

Tag and Library Coverage

GPT-5.5	DeepSeek-V4-Pro

Key findings:

Even the strongest model passes under 60% of examples and produces performant code on only a third.
Every model collapses to near-zero coverage on specialized libraries such as Mscclpp, ThunderKittens, and the NCCL device API. Models hallucinate APIs, misplace synchronization, and ship kernels orders of magnitude slower than reference.
Multi-round self-correction helps only on commodity libraries and easier tasks. Giving deepseek-v4-pro 5 rounds raises its pass rate from 15.8% to 41.6%, but unlocks neither Hard examples nor specialized libraries.

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set at least one API key
export OPENAI_API_KEY="..."          # → --model gpt-4o
export GOOGLE_API_KEY="..."          # → --model gemini-3-pro-preview
export ANTHROPIC_API_KEY="..."       # → --model claude-sonnet-4-5-20250929

# 3. Verify GPU compiler
nvcc --version    # NVIDIA
hipcc --version   # AMD

# 4. Run an evaluation
python scripts/generate_eval_one.py example001_gpu_comm_single_process \
    --model gpt-4o \
    --max-rounds 3

This reads the empty_* template, sends it to the model, compiles and runs the generated code, compares it against the reference, and writes a summary.json with correctness and performance results.

Options

python scripts/generate_eval_one.py <example_name> [options]

Options:
  --model MODEL          LLM to use (default: gpt-4o)
  --max-rounds N         Max generation rounds with error feedback (default: 1)
  --temperature FLOAT    Sampling temperature (default: 0.3)
  --datasets-dir PATH    Base datasets directory (default: ./datasets)
  --no-save              Don't save generated code
  --quiet                Suppress detailed output

A single example's build_and_run.py can also be run directly:

python build_and_run.py --source ref_gpu_p2p_comm.cpp                              # build & run reference
python build_and_run.py --compare ref_gpu_p2p_comm.cpp generated_gpu_p2p_comm.cpp  # compare ref vs generated

Contributing

To contribute a new example, please follow the requirements in the Dataset Instructions.

Acknowledgements

We thank Mibura and AMD for sponsoring the testbed for this benchmark.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
datasets		datasets
figs		figs
prompt_llm		prompt_llm
run_eval		run_eval
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Highlights

How It Works

Benchmark Categories

Leaderboard

Metrics

Top vs. Bottom Model

Difficulty Breakdown

Performance Quality Among PASS Examples

Tag and Library Coverage

Quick Start

Options

Contributing

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Highlights

How It Works

Benchmark Categories

Leaderboard

Metrics

Top vs. Bottom Model

Difficulty Breakdown

Performance Quality Among PASS Examples

Tag and Library Coverage

Quick Start

Options

Contributing

Acknowledgements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages