An educational CLIP-style project built with PyTorch and trained on FashionMNIST. The repository pairs a small vision transformer with a transformer text encoder, projects both into a shared embedding space, and learns with a symmetric contrastive loss.
This is a learning-oriented implementation, not an exact reproduction of OpenAI CLIP. The codebase is intentionally small, the dataset is tiny compared with real CLIP training corpora, and most of the project is designed to support the companion notebook and tutorial article.
For a source-level walkthrough, see docs/technical-overview.md.
The project trains on image and caption pairs generated from FashionMNIST labels. Each class is converted into a caption such as "An image of a dress" or "An image of a sneaker", then the model learns to match the correct image and text pair within each batch.
At a high level, the repository contains:
- A ViT-style image encoder with patch embedding, a learnable class token, sinusoidal positional embeddings, and transformer encoder blocks
- A text encoder with a byte-level tokenizer, sinusoidal positional embeddings, transformer encoder blocks, and end-of-text pooling
- A CLIP-style contrastive objective with a learnable temperature parameter
- A training script that trains, saves the best checkpoint by training loss, reloads the checkpoint, and evaluates it on the test split
- A notebook that mirrors the tutorial flow and includes a small zero-shot classification demo
This repo is:
- A compact reference implementation for learning how CLIP-style training works
- Focused on FashionMNIST and fixed prompt templates
- Easy to read end to end in a few files
This repo is not:
- The official OpenAI CLIP repository or a faithful reproduction of its training setup
- A large-scale open-vocabulary vision-language model
- Packaged as a reusable library or CLI
training.py: training and test entry pointmodel/model.py: positional embeddings, attention blocks, image encoder, text encoder, and CLIP lossdata/data_utils.py: FashionMNIST dataset wrapper and tokenizernotebooks/CLIPModel.ipynb: tutorial-style notebook with saved outputsdocs/technical-overview.md: deeper architecture and behavior notes
The project depends on PyTorch, TorchVision, Hugging Face Datasets, and NumPy.
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade piprequirements.txt records plain version pins for the packages used by the project, so a standard install now works on typical PyTorch-supported environments:
python3 -m pip install -r requirements.txtIf your platform still needs a custom PyTorch build, install the appropriate torch and torchvision pair first and then install the remaining packages.
The first run downloads the public fashion_mnist dataset through Hugging Face Datasets, so network access is required.
Run the full training-and-test flow with:
python3 training.pyBy default the script:
- Detects
cudaif available, otherwise usescpu - Trains for 10 epochs on the FashionMNIST training split
- Saves the best checkpoint to
./clip.pt - Reloads that checkpoint
- Evaluates similarity against the 10 FashionMNIST class captions and prints accuracy
clip.pt is ignored by git, so local training artifacts do not pollute the repository.
The values below come from training.py:
| Parameter | Value |
|---|---|
| Embedding dimension | 32 |
| Vision width | 9 |
| Image size | (28, 28) |
| Patch size | (14, 14) |
| Image channels | 1 |
| Vision layers | 3 |
| Vision heads | 3 |
| Vocabulary size | 256 |
| Text width | 32 |
| Max sequence length | 32 |
| Text heads | 8 |
| Text layers | 4 |
| Learning rate | 1e-3 |
| Epochs | 10 |
| Batch size | 128 |
The test() function does not evaluate against arbitrary natural-language prompts. Instead, it embeds the same 10 caption templates used during training, compares each test image to those caption embeddings, and measures whether the highest-similarity caption matches the expected label.
That means the printed accuracy behaves much more like a closed-set FashionMNIST classifier than a broad open-world benchmark. The notebook includes an additional "zero-shot" demo, but it still ranks predictions within the same FashionMNIST class set.
The saved notebook output shows a CPU run finishing with:
- Best reported batch loss of
2.642 Model Accuracy: 84 %
The companion Medium article summarizes the result as "around 85%". Treat those numbers as illustrative tutorial outputs, not as a fixed benchmark or reproducibility guarantee.
- The tokenizer is a very small byte-level scheme with
vocab_size = 256, not the BPE tokenizer used by the original CLIP models. - Captions are hand-authored templates derived directly from class IDs.
- Checkpoints are selected using the final batch loss of each epoch, not a validation metric.
- The Python script does not expose the notebook's zero-shot example as a command-line workflow.
- The repo currently has no automated test suite.
- Colab notebook: the tutorial notebook with saved outputs, testing, and a zero-shot ranking example
- Medium article: the narrative walkthrough that explains the architecture and training flow in more detail
- Technical overview: repo-specific notes that map the article and notebook back to the actual source files