The original OpenAI Parameter Golf challenge lives at openai/parameter-golf. This repository is my experiment fork: it collects the model variants, tokenizer/data experiments, logs, reports, helper scripts, and record snapshots we tried while iterating.
The core training entrypoints are:
train_gpt.py: PyTorch/CUDA training.train_gpt_mlx.py: Apple MLX training for local iteration (Not all variations exist here yet).
Derived reports live under reports/. Raw run logs stay in logs/ for MLX and gpu_logs/ for CUDA/PyTorch.
Create an environment and install the packages used by the local MLX path plus dataset tooling:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install mlx numpy sentencepiece huggingface-hub datasets tqdmDownload cached FineWeb shards and tokenizer assets:
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10Run a small MLX smoke job:
RUN_ID=mlx_smoke \
ITERATIONS=200 \
TRAIN_BATCH_TOKENS=8192 \
VAL_LOSS_EVERY=0 \
VAL_BATCH_SIZE=8192 \
python3 train_gpt_mlx.pyRun a CUDA/PyTorch job directly:
RUN_ID=baseline_sp1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.pyOr use the current-run launcher:
./run_gpu_job.shSee docs/gpu_experiment_runbook.md for archived GPU command examples and common environment toggles.
The repo contains both live knobs in the current trainers and archived variants under records/, logs/, and gpu_logs/. In broad strokes, we tried:
- Tokenizer variants: baseline 1024-token SentencePiece BPE plus filter-aware 2048, 4096, and 8192 BPE tokenizers in
data/tokenizers/. - Dataset filtering: exact deduplication, MinHash near-deduplication, URL/repetition/quality filters, removed-doc manifests, filtered train shards, and unchanged validation shard copies.
- Dataset ordering: MTLD-sorted train documents and token-length-sorted train documents.
- Longer context and sequence-length sweeps: 2048-token and 4096-token training/evaluation experiments.
- Learning-rate schedule changes: lower/higher LR sweeps, tuned tied-embedding/matrix/scalar learning rates, warmdown percent sweeps, and a wallclock-aware warmdown.
- Initialization changes: zero-init toggles, norm-init experiments, q/k gain sweeps, overtone/spectral-style embedding init in archived runs, and orthogonalization-related trials.
- Architecture changes: wider/deeper models, 10/11-layer variants, extra layers, depth recurrence, U-Net/asymmetric U-Net shapes, residual/skip tweaks, no-output-projection runs, and partial/exclusive attention variants.
- Attention experiments: exclusive self-attention, post-attention normalization, post-exclusive-attention, partial XSA, XSA on later layers, NTK/partial RoPE, YaRN-style context scaling in archived CUDA variants, and FlashAttention-oriented variants.
- MLP changes: SwiGLU, larger MLP multipliers, leaky ReLU squared, SmearGate, and MLP quantization mixes.
- Multi-token prediction: 2-token and 3-token MTP heads, MTP-only runs, MTP plus MoE, and excluded/auxiliary MTP-head export experiments.
- Decoder MoE: Mixture of Experts over the decoder layers.
- Test-time training: This is implemented in the MLX
- Bigram features: BigramHash embeddings, bigram vocab/dimension sweeps, and combinations with SmearGate and quantized MLPs.
- Weight averaging: SWA-style runs, EMA variants, and late weight-average start sweeps.
- Optimizer variants: Muon/Adam split tuning, Parallel Muon in archived records, NeoMuon in archived ternary/binary variants, Muon weight decay changes, and gradient accumulation/batch-size sweeps.
- Quantization and compression: int8 roundtrip export, GPTQ-lite/GPTQ embedding calibration, QAT, int6/int5/mixed precision, ternary weights, binary weights, FP8-oriented archived variants, zlib/zstd-style compressed artifacts, and calibration-token sweeps.
- Evaluation variants: sliding-window evaluation, stride changes, tokenizer-aware
val_bpb, int8/int6 roundtrip validation, prediction-sample dumps, and final validation with optional TTT.
Generated artifacts are organized by type:
reports/run_summaries/: GPU and MLX TSV summaries.reports/loss_plots/: generated training-loss plots.reports/dataset_analysis/: dataset analysis JSON, CSV, and sample-doc artifacts.
Regenerate summaries:
python generate_run_summary.py --backend mlx
python generate_run_summary.py --backend gpuRegenerate plots:
python plot_training_loss.py --backend mlx
python plot_training_loss.py --backend gpuThe legacy wrapper names still work:
python generate_mlx_run_summary.py
python generate_gpu_run_summary.py
python plot_mlx_training_loss.py
python plot_gpu_training_loss.pyrecords/: snapshots of notable runs and submitted variants kept for reference.logs/: MLX raw logs and loss CSVs.gpu_logs/: CUDA/PyTorch raw logs and loss CSVs.
This repository adapts code from modded-nanogpt; see THIRD_PARTY_NOTICES.md for attribution.