Skip to content

mbn312/CLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLIP

Open In Colab

Medium

An educational CLIP-style project built with PyTorch and trained on FashionMNIST. The repository pairs a small vision transformer with a transformer text encoder, projects both into a shared embedding space, and learns with a symmetric contrastive loss.

This is a learning-oriented implementation, not an exact reproduction of OpenAI CLIP. The codebase is intentionally small, the dataset is tiny compared with real CLIP training corpora, and most of the project is designed to support the companion notebook and tutorial article.

For a source-level walkthrough, see docs/technical-overview.md.

Overview

The project trains on image and caption pairs generated from FashionMNIST labels. Each class is converted into a caption such as "An image of a dress" or "An image of a sneaker", then the model learns to match the correct image and text pair within each batch.

At a high level, the repository contains:

  • A ViT-style image encoder with patch embedding, a learnable class token, sinusoidal positional embeddings, and transformer encoder blocks
  • A text encoder with a byte-level tokenizer, sinusoidal positional embeddings, transformer encoder blocks, and end-of-text pooling
  • A CLIP-style contrastive objective with a learnable temperature parameter
  • A training script that trains, saves the best checkpoint by training loss, reloads the checkpoint, and evaluates it on the test split
  • A notebook that mirrors the tutorial flow and includes a small zero-shot classification demo

What This Repo Is And Is Not

This repo is:

  • A compact reference implementation for learning how CLIP-style training works
  • Focused on FashionMNIST and fixed prompt templates
  • Easy to read end to end in a few files

This repo is not:

  • The official OpenAI CLIP repository or a faithful reproduction of its training setup
  • A large-scale open-vocabulary vision-language model
  • Packaged as a reusable library or CLI

Repository Layout

Setup

The project depends on PyTorch, TorchVision, Hugging Face Datasets, and NumPy.

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip

requirements.txt records plain version pins for the packages used by the project, so a standard install now works on typical PyTorch-supported environments:

python3 -m pip install -r requirements.txt

If your platform still needs a custom PyTorch build, install the appropriate torch and torchvision pair first and then install the remaining packages.

The first run downloads the public fashion_mnist dataset through Hugging Face Datasets, so network access is required.

Quickstart

Run the full training-and-test flow with:

python3 training.py

By default the script:

  1. Detects cuda if available, otherwise uses cpu
  2. Trains for 10 epochs on the FashionMNIST training split
  3. Saves the best checkpoint to ./clip.pt
  4. Reloads that checkpoint
  5. Evaluates similarity against the 10 FashionMNIST class captions and prints accuracy

clip.pt is ignored by git, so local training artifacts do not pollute the repository.

Default Configuration

The values below come from training.py:

Parameter Value
Embedding dimension 32
Vision width 9
Image size (28, 28)
Patch size (14, 14)
Image channels 1
Vision layers 3
Vision heads 3
Vocabulary size 256
Text width 32
Max sequence length 32
Text heads 8
Text layers 4
Learning rate 1e-3
Epochs 10
Batch size 128

How Evaluation Works

The test() function does not evaluate against arbitrary natural-language prompts. Instead, it embeds the same 10 caption templates used during training, compares each test image to those caption embeddings, and measures whether the highest-similarity caption matches the expected label.

That means the printed accuracy behaves much more like a closed-set FashionMNIST classifier than a broad open-world benchmark. The notebook includes an additional "zero-shot" demo, but it still ranks predictions within the same FashionMNIST class set.

Reported Results In The Linked Resources

The saved notebook output shows a CPU run finishing with:

  • Best reported batch loss of 2.642
  • Model Accuracy: 84 %

The companion Medium article summarizes the result as "around 85%". Treat those numbers as illustrative tutorial outputs, not as a fixed benchmark or reproducibility guarantee.

Limitations

  • The tokenizer is a very small byte-level scheme with vocab_size = 256, not the BPE tokenizer used by the original CLIP models.
  • Captions are hand-authored templates derived directly from class IDs.
  • Checkpoints are selected using the final batch loss of each epoch, not a validation metric.
  • The Python script does not expose the notebook's zero-shot example as a command-line workflow.
  • The repo currently has no automated test suite.

Additional Resources

  • Colab notebook: the tutorial notebook with saved outputs, testing, and a zero-shot ranking example
  • Medium article: the narrative walkthrough that explains the architecture and training flow in more detail
  • Technical overview: repo-specific notes that map the article and notebook back to the actual source files

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors