Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) by gagika · Pull Request #4118 · AI-Hypercomputer/maxtext

gagika · 2026-06-09T16:43:19Z

Description

Performance optimization of the qwen3-30b-a3b and gpt-oss-20b distillation configs on
TPU v7x (tpu7x-4x4x4).

MFU on v7x:

qwen3-30b-a3b: ~20% → ~26% for pdbs=6 and the pdbs=8 + activation-offload
variant: ~22% → ~24%
gpt-oss-20b: ~17% → ~19%

All knobs are profile-derived (xplane) and documented inline:

context=device + custom remat — keep attention outputs on device so the
backward pass skips the splash-forward re-runs (per-layer forward call count
halves in the profile). The dominant win; the configs note the HBM frontiers
(qwen fits to pdbs=6 with the teacher resident, gpt-oss is pdbs=1 at seq 32k).
Megablox grouped-matmul tiles — qwen at the full dims (emb 2048 /
moe-mlp 768), ~+10% together with the layout change; gpt-oss raises the
batch-seq m-tile 512 → 1024 (its k/n dims are already full), ~+3%.
Splash-attention — sa_q_layout: SEQ_MINOR on qwen; kv-compute sub-blocks
2048 → 1024 on gpt-oss (~+2%; uniformly smaller blocks regress).
Batch / mesh — qwen default moves pdbs 4 → 6 in the headroom freed by
dropping host offload; gpt-oss mesh moves to dp2 × fsdp64.

Tests

30-step runs on tpu7x-4x4x4 (xpk), steady-state step times, each
delta measured against the baselines. Loss and perplexity unchanged from
baseline.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-09T16:47:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-09T16:53:28Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-06-09T16:56:34Z

🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details.

vlad-karp · 2026-06-09T18:36:04Z

@@ -1,4 +1,4 @@
-# Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~22% MFU on v7x.
+# Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~24% MFU on v7x.


what is the difference vs previous file config?

per device batch size is 8 here (yml name ends with pdbs8).
Previous one gets best MFU with pdbs=6, so we can default to that unless we need larger batch size for convergence.

…b (17%→18%).

gagika changed the title ~~Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→18%)~~ Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) distillation Jun 9, 2026

gagika added the gemini-review label Jun 9, 2026

gagika changed the title ~~Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) distillation~~ Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) Jun 9, 2026

gagika marked this pull request as ready for review June 9, 2026 17:50

gagika force-pushed the agagik-distill-perf branch from c03fbae to 44f5e6f Compare June 9, 2026 17:54

vlad-karp approved these changes Jun 9, 2026

View reviewed changes

Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20…

79ef432

…b (17%→18%).

gagika force-pushed the agagik-distill-perf branch from 44f5e6f to 79ef432 Compare June 11, 2026 15:59

entrpn approved these changes Jun 11, 2026

View reviewed changes

gagika added the pull ready label Jun 11, 2026

copybara-service Bot merged commit ef42536 into main Jun 11, 2026
54 of 57 checks passed

copybara-service Bot deleted the agagik-distill-perf branch June 11, 2026 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%)#4118

Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%)#4118
copybara-service[bot] merged 1 commit into
mainfrom
agagik-distill-perf

gagika commented Jun 9, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

vlad-karp Jun 9, 2026

Uh oh!

gagika Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -1,4 +1,4 @@
		# Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~22% MFU on v7x.
		# Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~24% MFU on v7x.

Conversation

gagika commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Jun 9, 2026

Codecov Report

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

vlad-karp Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gagika Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gagika commented Jun 9, 2026 •

edited

Loading