Skip to content

themendu/parameter-golf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

152 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parameter Golf Experiments

The original OpenAI Parameter Golf challenge lives at openai/parameter-golf. This repository is my experiment fork: it collects the model variants, tokenizer/data experiments, logs, reports, helper scripts, and record snapshots we tried while iterating.

The core training entrypoints are:

  • train_gpt.py: PyTorch/CUDA training.
  • train_gpt_mlx.py: Apple MLX training for local iteration (Not all variations exist here yet).

Derived reports live under reports/. Raw run logs stay in logs/ for MLX and gpu_logs/ for CUDA/PyTorch.

Quick Start

Create an environment and install the packages used by the local MLX path plus dataset tooling:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install mlx numpy sentencepiece huggingface-hub datasets tqdm

Download cached FineWeb shards and tokenizer assets:

python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10

Run a small MLX smoke job:

RUN_ID=mlx_smoke \
ITERATIONS=200 \
TRAIN_BATCH_TOKENS=8192 \
VAL_LOSS_EVERY=0 \
VAL_BATCH_SIZE=8192 \
python3 train_gpt_mlx.py

Run a CUDA/PyTorch job directly:

RUN_ID=baseline_sp1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

Or use the current-run launcher:

./run_gpu_job.sh

See docs/gpu_experiment_runbook.md for archived GPU command examples and common environment toggles.

Things We Tried

The repo contains both live knobs in the current trainers and archived variants under records/, logs/, and gpu_logs/. In broad strokes, we tried:

  • Tokenizer variants: baseline 1024-token SentencePiece BPE plus filter-aware 2048, 4096, and 8192 BPE tokenizers in data/tokenizers/.
  • Dataset filtering: exact deduplication, MinHash near-deduplication, URL/repetition/quality filters, removed-doc manifests, filtered train shards, and unchanged validation shard copies.
  • Dataset ordering: MTLD-sorted train documents and token-length-sorted train documents.
  • Longer context and sequence-length sweeps: 2048-token and 4096-token training/evaluation experiments.
  • Learning-rate schedule changes: lower/higher LR sweeps, tuned tied-embedding/matrix/scalar learning rates, warmdown percent sweeps, and a wallclock-aware warmdown.
  • Initialization changes: zero-init toggles, norm-init experiments, q/k gain sweeps, overtone/spectral-style embedding init in archived runs, and orthogonalization-related trials.
  • Architecture changes: wider/deeper models, 10/11-layer variants, extra layers, depth recurrence, U-Net/asymmetric U-Net shapes, residual/skip tweaks, no-output-projection runs, and partial/exclusive attention variants.
  • Attention experiments: exclusive self-attention, post-attention normalization, post-exclusive-attention, partial XSA, XSA on later layers, NTK/partial RoPE, YaRN-style context scaling in archived CUDA variants, and FlashAttention-oriented variants.
  • MLP changes: SwiGLU, larger MLP multipliers, leaky ReLU squared, SmearGate, and MLP quantization mixes.
  • Multi-token prediction: 2-token and 3-token MTP heads, MTP-only runs, MTP plus MoE, and excluded/auxiliary MTP-head export experiments.
  • Decoder MoE: Mixture of Experts over the decoder layers.
  • Test-time training: This is implemented in the MLX
  • Bigram features: BigramHash embeddings, bigram vocab/dimension sweeps, and combinations with SmearGate and quantized MLPs.
  • Weight averaging: SWA-style runs, EMA variants, and late weight-average start sweeps.
  • Optimizer variants: Muon/Adam split tuning, Parallel Muon in archived records, NeoMuon in archived ternary/binary variants, Muon weight decay changes, and gradient accumulation/batch-size sweeps.
  • Quantization and compression: int8 roundtrip export, GPTQ-lite/GPTQ embedding calibration, QAT, int6/int5/mixed precision, ternary weights, binary weights, FP8-oriented archived variants, zlib/zstd-style compressed artifacts, and calibration-token sweeps.
  • Evaluation variants: sliding-window evaluation, stride changes, tokenizer-aware val_bpb, int8/int6 roundtrip validation, prediction-sample dumps, and final validation with optional TTT.

Reports

Generated artifacts are organized by type:

  • reports/run_summaries/: GPU and MLX TSV summaries.
  • reports/loss_plots/: generated training-loss plots.
  • reports/dataset_analysis/: dataset analysis JSON, CSV, and sample-doc artifacts.

Regenerate summaries:

python generate_run_summary.py --backend mlx
python generate_run_summary.py --backend gpu

Regenerate plots:

python plot_training_loss.py --backend mlx
python plot_training_loss.py --backend gpu

The legacy wrapper names still work:

python generate_mlx_run_summary.py
python generate_gpu_run_summary.py
python plot_mlx_training_loss.py
python plot_gpu_training_loss.py

Records And Logs

  • records/: snapshots of notable runs and submitted variants kept for reference.
  • logs/: MLX raw logs and loss CSVs.
  • gpu_logs/: CUDA/PyTorch raw logs and loss CSVs.

This repository adapts code from modded-nanogpt; see THIRD_PARTY_NOTICES.md for attribution.

About

A fork of the parameter golf challenge, including my experiment results

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors