fineweb
Here are 11 public repositories matching this topic...
Train a SmolLM-style llm on fineweb-edu in JAX/Flax with an assortment of optimizers.
-
Updated
Jul 24, 2025 - Python
Ingest the public web (Common Crawl + HF FineWeb + RSS + GDELT) onto your laptop. Queryable from DuckDB. Single Python process — no Spark, no cloud.
-
Updated
May 31, 2026 - Python
A 66M parameter decoder-only transformer language model implemented from scratch in PyTorch. Features a custom SentencePiece tokenizer, RoPE positional embeddings, SwiGLU feed-forward network, per-layer KV cache for efficient autoregressive inference, and a Svelte-based streaming chat interface.
-
Updated
May 13, 2026 - Python
Training GPT-2 on FineWeb-Edu in JAX/Flax
-
Updated
May 29, 2026 - Python
OpenAI Parameter Golf competition entry — adaptive Hessian-sensitivity GPTQ clipping on PR #1855's stack. 1.06310 BPB (3-seed mean), PR #1962.
-
Updated
May 6, 2026 - Python
Decoder-only LLM from scratch with reproducible data pipelines, tokenizer/sharding workflows, and GPU training.
-
Updated
Apr 13, 2026 - Python
FineWeb-Edu dataset analysis using Apache Spark - DSC 232R group project
-
Updated
Mar 24, 2026 - Jupyter Notebook
A minimal GPT model pretraining pipeline (mostly taken from Andrej Karpathy's build-nanogpt repo)
-
Updated
May 18, 2026 - Python
Improve this page
Add a description, image, and links to the fineweb topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the fineweb topic, visit your repo's landing page and select "manage topics."