Skip to content

Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%)#4118

Merged
copybara-service[bot] merged 1 commit into
mainfrom
agagik-distill-perf
Jun 11, 2026
Merged

Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%)#4118
copybara-service[bot] merged 1 commit into
mainfrom
agagik-distill-perf

Conversation

@gagika

@gagika gagika commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Description

Performance optimization of the qwen3-30b-a3b and gpt-oss-20b distillation configs on
TPU v7x (tpu7x-4x4x4).

MFU on v7x:

  • qwen3-30b-a3b: ~20% → ~26% for pdbs=6 and the pdbs=8 + activation-offload
    variant: ~22% → ~24%
  • gpt-oss-20b: ~17% → ~19%

All knobs are profile-derived (xplane) and documented inline:

  • context=device + custom remat — keep attention outputs on device so the
    backward pass skips the splash-forward re-runs (per-layer forward call count
    halves in the profile). The dominant win; the configs note the HBM frontiers
    (qwen fits to pdbs=6 with the teacher resident, gpt-oss is pdbs=1 at seq 32k).
  • Megablox grouped-matmul tiles — qwen at the full dims (emb 2048 /
    moe-mlp 768), ~+10% together with the layout change; gpt-oss raises the
    batch-seq m-tile 512 → 1024 (its k/n dims are already full), ~+3%.
  • Splash-attentionsa_q_layout: SEQ_MINOR on qwen; kv-compute sub-blocks
    2048 → 1024 on gpt-oss (~+2%; uniformly smaller blocks regress).
  • Batch / mesh — qwen default moves pdbs 4 → 6 in the headroom freed by
    dropping host offload; gpt-oss mesh moves to dp2 × fsdp64.

Tests

30-step runs on tpu7x-4x4x4 (xpk), steady-state step times, each
delta measured against the baselines. Loss and perplexity unchanged from
baseline.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@gagika gagika changed the title Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→18%) Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) distillation Jun 9, 2026
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🤖 I'm sorry @gagika, but I was unable to process your request. Please see the logs for more details.

@gagika gagika changed the title Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) distillation Optimize distillation perf for qwen3-30b (20%→26% MFU) and gpt-oss-20b (17%→19%) Jun 9, 2026
@gagika gagika marked this pull request as ready for review June 9, 2026 17:50
@gagika gagika force-pushed the agagik-distill-perf branch from c03fbae to 44f5e6f Compare June 9, 2026 17:54
@@ -1,4 +1,4 @@
# Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~22% MFU on v7x.
# Qwen3-30b-a3b-base distillation, pdbs=8 + activation offload. ~24% MFU on v7x.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference vs previous file config?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per device batch size is 8 here (yml name ends with pdbs8).
Previous one gets best MFU with pdbs=6, so we can default to that unless we need larger batch size for convergence.

@gagika gagika force-pushed the agagik-distill-perf branch from 44f5e6f to 79ef432 Compare June 11, 2026 15:59
@copybara-service copybara-service Bot merged commit ef42536 into main Jun 11, 2026
54 of 57 checks passed
@copybara-service copybara-service Bot deleted the agagik-distill-perf branch June 11, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants