Parameter Golf Experiments

The original OpenAI Parameter Golf challenge lives at openai/parameter-golf. This repository is my experiment fork: it collects the model variants, tokenizer/data experiments, logs, reports, helper scripts, and record snapshots we tried while iterating.

The core training entrypoints are:

train_gpt.py: PyTorch/CUDA training.
train_gpt_mlx.py: Apple MLX training for local iteration (Not all variations exist here yet).

Derived reports live under reports/. Raw run logs stay in logs/ for MLX and gpu_logs/ for CUDA/PyTorch.

Quick Start

Create an environment and install the packages used by the local MLX path plus dataset tooling:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install mlx numpy sentencepiece huggingface-hub datasets tqdm

Download cached FineWeb shards and tokenizer assets:

python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10

Run a small MLX smoke job:

RUN_ID=mlx_smoke \
ITERATIONS=200 \
TRAIN_BATCH_TOKENS=8192 \
VAL_LOSS_EVERY=0 \
VAL_BATCH_SIZE=8192 \
python3 train_gpt_mlx.py

Run a CUDA/PyTorch job directly:

RUN_ID=baseline_sp1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

Or use the current-run launcher:

./run_gpu_job.sh

See docs/gpu_experiment_runbook.md for archived GPU command examples and common environment toggles.

Things We Tried

The repo contains both live knobs in the current trainers and archived variants under records/, logs/, and gpu_logs/. In broad strokes, we tried:

Tokenizer variants: baseline 1024-token SentencePiece BPE plus filter-aware 2048, 4096, and 8192 BPE tokenizers in data/tokenizers/.
Dataset filtering: exact deduplication, MinHash near-deduplication, URL/repetition/quality filters, removed-doc manifests, filtered train shards, and unchanged validation shard copies.
Dataset ordering: MTLD-sorted train documents and token-length-sorted train documents.
Longer context and sequence-length sweeps: 2048-token and 4096-token training/evaluation experiments.
Learning-rate schedule changes: lower/higher LR sweeps, tuned tied-embedding/matrix/scalar learning rates, warmdown percent sweeps, and a wallclock-aware warmdown.
Initialization changes: zero-init toggles, norm-init experiments, q/k gain sweeps, overtone/spectral-style embedding init in archived runs, and orthogonalization-related trials.
Architecture changes: wider/deeper models, 10/11-layer variants, extra layers, depth recurrence, U-Net/asymmetric U-Net shapes, residual/skip tweaks, no-output-projection runs, and partial/exclusive attention variants.
Attention experiments: exclusive self-attention, post-attention normalization, post-exclusive-attention, partial XSA, XSA on later layers, NTK/partial RoPE, YaRN-style context scaling in archived CUDA variants, and FlashAttention-oriented variants.
MLP changes: SwiGLU, larger MLP multipliers, leaky ReLU squared, SmearGate, and MLP quantization mixes.
Multi-token prediction: 2-token and 3-token MTP heads, MTP-only runs, MTP plus MoE, and excluded/auxiliary MTP-head export experiments.
Decoder MoE: Mixture of Experts over the decoder layers.
Test-time training: This is implemented in the MLX
Bigram features: BigramHash embeddings, bigram vocab/dimension sweeps, and combinations with SmearGate and quantized MLPs.
Weight averaging: SWA-style runs, EMA variants, and late weight-average start sweeps.
Optimizer variants: Muon/Adam split tuning, Parallel Muon in archived records, NeoMuon in archived ternary/binary variants, Muon weight decay changes, and gradient accumulation/batch-size sweeps.
Quantization and compression: int8 roundtrip export, GPTQ-lite/GPTQ embedding calibration, QAT, int6/int5/mixed precision, ternary weights, binary weights, FP8-oriented archived variants, zlib/zstd-style compressed artifacts, and calibration-token sweeps.
Evaluation variants: sliding-window evaluation, stride changes, tokenizer-aware val_bpb, int8/int6 roundtrip validation, prediction-sample dumps, and final validation with optional TTT.

Reports

Generated artifacts are organized by type:

reports/run_summaries/: GPU and MLX TSV summaries.
reports/loss_plots/: generated training-loss plots.
reports/dataset_analysis/: dataset analysis JSON, CSV, and sample-doc artifacts.

Regenerate summaries:

python generate_run_summary.py --backend mlx
python generate_run_summary.py --backend gpu

Regenerate plots:

python plot_training_loss.py --backend mlx
python plot_training_loss.py --backend gpu

The legacy wrapper names still work:

python generate_mlx_run_summary.py
python generate_gpu_run_summary.py
python plot_mlx_training_loss.py
python plot_gpu_training_loss.py

Records And Logs

records/: snapshots of notable runs and submitted variants kept for reference.
logs/: MLX raw logs and loss CSVs.
gpu_logs/: CUDA/PyTorch raw logs and loss CSVs.

This repository adapts code from modded-nanogpt; see THIRD_PARTY_NOTICES.md for attribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parameter Golf Experiments

Quick Start

Things We Tried

Reports

Records And Logs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
data		data
data_scripts		data_scripts
docs		docs
gpu_logs		gpu_logs
logs		logs
records		records
reports		reports
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
experiment_logs.py		experiment_logs.py
generate_gpu_run_summary.py		generate_gpu_run_summary.py
generate_mlx_run_summary.py		generate_mlx_run_summary.py
generate_run_summary.py		generate_run_summary.py
plot_gpu_training_loss.py		plot_gpu_training_loss.py
plot_mlx_training_loss.py		plot_mlx_training_loss.py
plot_training_loss.py		plot_training_loss.py
requirements.txt		requirements.txt
run_gpu_job.sh		run_gpu_job.sh
train_gpt.py		train_gpt.py
train_gpt_mlx.py		train_gpt_mlx.py

Folders and files

Latest commit

History

Repository files navigation

Parameter Golf Experiments

Quick Start

Things We Tried

Reports

Records And Logs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages