CLIP

An educational CLIP-style project built with PyTorch and trained on FashionMNIST. The repository pairs a small vision transformer with a transformer text encoder, projects both into a shared embedding space, and learns with a symmetric contrastive loss.

This is a learning-oriented implementation, not an exact reproduction of OpenAI CLIP. The codebase is intentionally small, the dataset is tiny compared with real CLIP training corpora, and most of the project is designed to support the companion notebook and tutorial article.

For a source-level walkthrough, see docs/technical-overview.md.

Overview

The project trains on image and caption pairs generated from FashionMNIST labels. Each class is converted into a caption such as "An image of a dress" or "An image of a sneaker", then the model learns to match the correct image and text pair within each batch.

At a high level, the repository contains:

A ViT-style image encoder with patch embedding, a learnable class token, sinusoidal positional embeddings, and transformer encoder blocks
A text encoder with a byte-level tokenizer, sinusoidal positional embeddings, transformer encoder blocks, and end-of-text pooling
A CLIP-style contrastive objective with a learnable temperature parameter
A training script that trains, saves the best checkpoint by training loss, reloads the checkpoint, and evaluates it on the test split
A notebook that mirrors the tutorial flow and includes a small zero-shot classification demo

What This Repo Is And Is Not

This repo is:

A compact reference implementation for learning how CLIP-style training works
Focused on FashionMNIST and fixed prompt templates
Easy to read end to end in a few files

This repo is not:

The official OpenAI CLIP repository or a faithful reproduction of its training setup
A large-scale open-vocabulary vision-language model
Packaged as a reusable library or CLI

Repository Layout

training.py: training and test entry point
model/model.py: positional embeddings, attention blocks, image encoder, text encoder, and CLIP loss
data/data_utils.py: FashionMNIST dataset wrapper and tokenizer
notebooks/CLIPModel.ipynb: tutorial-style notebook with saved outputs
docs/technical-overview.md: deeper architecture and behavior notes

Setup

The project depends on PyTorch, TorchVision, Hugging Face Datasets, and NumPy.

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip

requirements.txt records plain version pins for the packages used by the project, so a standard install now works on typical PyTorch-supported environments:

python3 -m pip install -r requirements.txt

If your platform still needs a custom PyTorch build, install the appropriate torch and torchvision pair first and then install the remaining packages.

The first run downloads the public fashion_mnist dataset through Hugging Face Datasets, so network access is required.

Quickstart

Run the full training-and-test flow with:

python3 training.py

By default the script:

Detects cuda if available, otherwise uses cpu
Trains for 10 epochs on the FashionMNIST training split
Saves the best checkpoint to ./clip.pt
Reloads that checkpoint
Evaluates similarity against the 10 FashionMNIST class captions and prints accuracy

clip.pt is ignored by git, so local training artifacts do not pollute the repository.

Default Configuration

The values below come from training.py:

Parameter	Value
Embedding dimension	`32`
Vision width	`9`
Image size	`(28, 28)`
Patch size	`(14, 14)`
Image channels	`1`
Vision layers	`3`
Vision heads	`3`
Vocabulary size	`256`
Text width	`32`
Max sequence length	`32`
Text heads	`8`
Text layers	`4`
Learning rate	`1e-3`
Epochs	`10`
Batch size	`128`

How Evaluation Works

The test() function does not evaluate against arbitrary natural-language prompts. Instead, it embeds the same 10 caption templates used during training, compares each test image to those caption embeddings, and measures whether the highest-similarity caption matches the expected label.

That means the printed accuracy behaves much more like a closed-set FashionMNIST classifier than a broad open-world benchmark. The notebook includes an additional "zero-shot" demo, but it still ranks predictions within the same FashionMNIST class set.

Reported Results In The Linked Resources

The saved notebook output shows a CPU run finishing with:

Best reported batch loss of 2.642
Model Accuracy: 84 %

The companion Medium article summarizes the result as "around 85%". Treat those numbers as illustrative tutorial outputs, not as a fixed benchmark or reproducibility guarantee.

Limitations

The tokenizer is a very small byte-level scheme with vocab_size = 256, not the BPE tokenizer used by the original CLIP models.
Captions are hand-authored templates derived directly from class IDs.
Checkpoints are selected using the final batch loss of each epoch, not a validation metric.
The Python script does not expose the notebook's zero-shot example as a command-line workflow.
The repo currently has no automated test suite.

Additional Resources

Colab notebook: the tutorial notebook with saved outputs, testing, and a zero-shot ranking example
Medium article: the narrative walkthrough that explains the architecture and training flow in more detail
Technical overview: repo-specific notes that map the article and notebook back to the actual source files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP

Overview

What This Repo Is And Is Not

Repository Layout

Setup

Quickstart

Default Configuration

How Evaluation Works

Reported Results In The Linked Resources

Limitations

Additional Resources

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs		docs
model		model
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
training.py		training.py

Folders and files

Latest commit

History

Repository files navigation

CLIP

Overview

What This Repo Is And Is Not

Repository Layout

Setup

Quickstart

Default Configuration

How Evaluation Works

Reported Results In The Linked Resources

Limitations

Additional Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages