Vision Drop and Automatic Vision Model Selection for Image Classification Tasks

Alessandro Viespoli

Based on the original work by Shwai He*, Guoheng Sun*, Zheyu Shen, Ang Li — University of Maryland, College Park

⚙️ Installation • 📦 Layout • 🧰 Models • 🚀 Dropping • 📊 Benchmark • 🤖 Model Selection • 📄 Citation

This repository extends the official implementation of Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping (TMLR 2026) with full support for vision transformers, multi-dataset benchmarking, and an automated model selection pipeline.

📖 Introduction

This project studies architectural redundancy in Transformer models — both LLMs and vision transformers — and provides practical pipelines for:

Block Drop — remove full Transformer blocks (attention + MLP together)
Layer Drop — drop attention or MLP sublayers independently
Joint Layer Drop — drop across both sublayer types simultaneously
Vision Transformer Support — DINOv2, DINOv3 ViT, SwinV2, ViT
Automated Model Selection — grid search over architectures, methods, and drop counts
Benchmarking — task accuracy, inference speed, FLOPs, and the SDR efficiency metric

The dropping pipeline is built on LLaMA-Factory.

⚙️ Installation

conda create -n llm-drop python=3.10 -y
conda activate llm-drop

git clone https://github.com/zincalex/LLM-Vision-Drop.git
cd LLM-Vision-Drop

# Core dropping pipeline
pip install -r requirements.txt
pip install -e .

Optional: Quantization dependencies (AWQ / GPTQ)

cd src/llmtuner/compression/quantization/AutoAWQ
pip install -e .

cd AutoAWQ_kernels
pip install -e .

cd ../../AutoGPTQ
pip install -vvv --no-build-isolation -e .

cd ../../../../../..

📦 Repository Layout

src/
├── compress.py                                # Entry point for dropping/compression
├── benchmark_speed.py                         # LLM inference speed benchmark
├── benchmark_vision_speed.py                  # Vision model speed + FLOPs benchmark
├── llmtuner/
│   └── compression/prune/                     # Core dropping algorithms (block, layer, joint)
│       └── models/                            # Custom dropped-model classes per architecture
├── vm-eval/
│   └── benchmark_vision.py                    # Vision evaluation harness (finetune head + test)
├── model-selection/
│   └── pipeline.py                            # Automated model selection pipeline
├── model-healing/
│   └── heal_model_vm.py                       # LoRA-based recovery after layer dropping
└── visualization/
    ├── compute_sdr.py                         # SDR metric computation
    └── visualize_benchmark_results.py

scripts/
├── dropping/                                  # Shell scripts for block/layer drop (LLM + vision)
├── benchmark/                                 # Evaluation and speed benchmark wrappers
├── model-selection/                           # Model selection runner
├── healing/                                   # Model healing runner
└── visualization/                             # SDR and result plotting

🧰 Prepare Models

Models are downloaded automatically from Hugging Face on first run via from_pretrained. For gated models (e.g. Llama-2), authenticate first:

huggingface-cli login

Supported architectures

Domain	Architecture	HuggingFace ID
Vision	DINOv2	`facebook/dinov2-giant-imagenet1k-1-layer`
Vision	DINOv3 ViT	`facebook/dinov3-vitl16-pretrain-lvd1689m`
Vision	SwinV2	`microsoft/swinv2-base-patch4-window16-256`
Vision	ViT	`google/vit-base-patch16-224`
LLM	Mistral-7B	`mistralai/Mistral-7B-v0.1`
LLM	Llama-2-13B	`meta-llama/Llama-2-13b-hf`

Dropped model config

After running the dropping pipeline, drop_attn_list and drop_mlp_list are written into the model's config.json. Example configurations:

// Drop attention layers only
{ "drop_attn_list": [25, 26, 24, 22], "drop_mlp_list": [] }

// Drop MLP layers only
{ "drop_attn_list": [], "drop_mlp_list": [26, 27, 25, 24] }

// Drop full blocks
{ "drop_attn_list": [26, 25, 24, 27], "drop_mlp_list": [26, 25, 24, 27] }

Custom model classes are stored under src/llmtuner/compression/prune/models/ and referenced via auto_map in the config.

🚀 Run Dropping

# Vision models
bash scripts/dropping/vision_block_drop.sh
bash scripts/dropping/vision_layer_drop.sh
bash scripts/dropping/vision_layer_drop_joint.sh

# LLMs
bash scripts/dropping/block_drop.sh
bash scripts/dropping/layer_drop.sh
bash scripts/dropping/layer_drop_joint.sh

Each script runs in two phases:

Similarity estimation — computes cosine similarity between layer inputs and outputs on a calibration set, identifies which layers to drop, and saves the config.
Post-dropping — applies the dropped config to the model checkpoint.

Similarity results are cached as .pt files under results_prune/cache/ so re-running with the same settings skips recomputation.

📊 Benchmark

🖼️ Vision model evaluation

Evaluates a dropped vision model on a dataset: optionally fine-tunes the classification head, runs inference on the test split, and saves logits/predictions to HDF5. Also prints per-layer execution verification (confirming which attention/MLP sublayers were actually skipped).

bash scripts/benchmark/benchmark_vm_eval.sh

Key arguments (edit the script or call directly):

CUDA_VISIBLE_DEVICES=0,1 accelerate launch src/vm-eval/benchmark_vision.py \
  --model_name_or_path ./dinov2_model \
  --dataset LCZ42 \
  --dataset_base_dir data \
  --prune_method layer_drop_attn \
  --drop_num 4 \
  --finetune_head \
  --epochs 20 \
  --lr 0.001 \
  --weight_decay 0.03 \
  --batch_size 32 \
  --batch_size_eval 10 \
  --output_file results/dinov2_lcz42_drop4.out

⚡ Inference speed + FLOPs

Measures throughput (images/s), latency, memory, and FLOPs for vision models. FLOPs computation respects the dropped layers in the config.

bash scripts/benchmark/benchmark_vm_speed.sh

📈 SDR metric

The Speedup Degradation Ratio (γ = ΔAccuracy / ΔSpeedup) measures accuracy cost per unit of throughput gain. Lower γ = more efficient compression.

bash scripts/visualization/compute_sdr_all.sh

Results are written to src/visualization/sdr_results/.

🧪 LLM task performance (lm-eval)

bash scripts/benchmark/benchmark_lm_eval.sh

This benchmark depends on EleutherAI/lm-evaluation-harness.
For strict reproduction, the repo uses this fork: s1ghhh/lm-evaluation-harness.
Use modeling files in src/llmtuner/model when loading Mistral/Llama with dropped configs.

Use the model files in src/llmtuner/model/ when loading Mistral/Llama with dropped configs.

⚡ LLM inference speed

bash scripts/benchmark/benchmark_speed.sh

Edit model_path, save_file, and model_type in the script before running.

Quantization benchmarks (AWQ / GPTQ)

bash scripts/quantization/awq.sh
bash scripts/quantization/gptq.sh

Edit model_path and quant_path in those scripts and ensure CUDA-compatible package versions are installed (see Installation).

🤖 Automated Model Selection

The model selection pipeline automates the full search over all combinations of architecture, pruning method, and drop count for a given dataset. It runs in three phases:

Baseline — fine-tunes the classification head at drop=0 to establish a reference accuracy.
Search — quick fine-tuning (few epochs) across all (arch × method × drop_n) variants. Early stopping halts a search direction if accuracy drops more than a configurable threshold below baseline.
Deep fine-tune — full fine-tuning of the winning variant, followed by final test-set evaluation.

Results are logged to results_selection/ with per-variant accuracy, the selected winner, and the final test metrics.

bash scripts/model-selection/run_selection.sh

Key parameters (edit run_selection.sh):

dataset="LCZ42"                              # Target dataset
architectures="dinov2 dinov3_vit swinv2 vit" # Architectures to search
prune_methods="block_drop layer_drop_attn layer_drop_mlp layer_drop_all"
drop_step=4              # Evaluate drop counts: 4, 8, 12, ...
early_stop_threshold=0.05 # Stop if accuracy drops >5% below baseline
baseline_epochs=5
search_epochs=5
deep_epochs=20

Supported datasets

Show all 10 datasets

Dataset	Domain	Classes	Notes
`imagenet-1k`	Natural images	1000	Standard benchmark; head fine-tuning skipped if classes match
`cifar10`	Natural images	10
`LCZ42`	Remote sensing	17	Urban morphology classification
`CrossD`	Cross-domain	varies	Multi-domain classification
`zoolake`	Microscopy	varies	Zooplankton identification
`lar`	Medical	varies	Laryngeal endoscopy
`InfLarynge`	Medical	varies	Inflamed laryngeal tissue
`DAPlankton`	Microscopy	varies	Plankton imaging
`Bark`	Texture	varies	Tree bark classification
`Pest`	Agriculture	varies	Crop pest identification

All datasets are stored as stratified HDF5 splits (train.h5, val.h5, test.h5) under data/<dataset>/. Preprocessing scripts are in data/.

📬 Contributor

Alessandro Viespoli (alessandro.viespoli@studenti.unipd.it)

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.vscode		.vscode
data		data
docs		docs
scripts		scripts
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
.rsyncignore		.rsyncignore
LICENSE		LICENSE
Layer_Drop.svg		Layer_Drop.svg
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
setup.py		setup.py
singularity_definition.def		singularity_definition.def

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Drop and Automatic Vision Model Selection for Image Classification Tasks

📖 Introduction

⚙️ Installation

📦 Repository Layout

🧰 Prepare Models

Supported architectures

Dropped model config

🚀 Run Dropping

📊 Benchmark

🖼️ Vision model evaluation

⚡ Inference speed + FLOPs

📈 SDR metric

🧪 LLM task performance (lm-eval)

⚡ LLM inference speed

🤖 Automated Model Selection

Supported datasets

📬 Contributor

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision Drop and Automatic Vision Model Selection for Image Classification Tasks

📖 Introduction

⚙️ Installation

📦 Repository Layout

🧰 Prepare Models

Supported architectures

Dropped model config

🚀 Run Dropping

📊 Benchmark

🖼️ Vision model evaluation

⚡ Inference speed + FLOPs

📈 SDR metric

🧪 LLM task performance (lm-eval)

⚡ LLM inference speed

🤖 Automated Model Selection

Supported datasets

📬 Contributor

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages