Alessandro Viespoli
Based on the original work by Shwai He*, Guoheng Sun*, Zheyu Shen, Ang Li — University of Maryland, College Park
⚙️ Installation • 📦 Layout • 🧰 Models • 🚀 Dropping • 📊 Benchmark • 🤖 Model Selection • 📄 Citation
This repository extends the official implementation of Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping (TMLR 2026) with full support for vision transformers, multi-dataset benchmarking, and an automated model selection pipeline.
This project studies architectural redundancy in Transformer models — both LLMs and vision transformers — and provides practical pipelines for:
- Block Drop — remove full Transformer blocks (attention + MLP together)
- Layer Drop — drop attention or MLP sublayers independently
- Joint Layer Drop — drop across both sublayer types simultaneously
- Vision Transformer Support — DINOv2, DINOv3 ViT, SwinV2, ViT
- Automated Model Selection — grid search over architectures, methods, and drop counts
- Benchmarking — task accuracy, inference speed, FLOPs, and the SDR efficiency metric
The dropping pipeline is built on LLaMA-Factory.
conda create -n llm-drop python=3.10 -y
conda activate llm-drop
git clone https://github.com/zincalex/LLM-Vision-Drop.git
cd LLM-Vision-Drop
# Core dropping pipeline
pip install -r requirements.txt
pip install -e .Optional: Quantization dependencies (AWQ / GPTQ)
cd src/llmtuner/compression/quantization/AutoAWQ
pip install -e .
cd AutoAWQ_kernels
pip install -e .
cd ../../AutoGPTQ
pip install -vvv --no-build-isolation -e .
cd ../../../../../..src/
├── compress.py # Entry point for dropping/compression
├── benchmark_speed.py # LLM inference speed benchmark
├── benchmark_vision_speed.py # Vision model speed + FLOPs benchmark
├── llmtuner/
│ └── compression/prune/ # Core dropping algorithms (block, layer, joint)
│ └── models/ # Custom dropped-model classes per architecture
├── vm-eval/
│ └── benchmark_vision.py # Vision evaluation harness (finetune head + test)
├── model-selection/
│ └── pipeline.py # Automated model selection pipeline
├── model-healing/
│ └── heal_model_vm.py # LoRA-based recovery after layer dropping
└── visualization/
├── compute_sdr.py # SDR metric computation
└── visualize_benchmark_results.py
scripts/
├── dropping/ # Shell scripts for block/layer drop (LLM + vision)
├── benchmark/ # Evaluation and speed benchmark wrappers
├── model-selection/ # Model selection runner
├── healing/ # Model healing runner
└── visualization/ # SDR and result plotting
Models are downloaded automatically from Hugging Face on first run via from_pretrained. For gated models (e.g. Llama-2), authenticate first:
huggingface-cli login| Domain | Architecture | HuggingFace ID |
|---|---|---|
| Vision | DINOv2 | facebook/dinov2-giant-imagenet1k-1-layer |
| Vision | DINOv3 ViT | facebook/dinov3-vitl16-pretrain-lvd1689m |
| Vision | SwinV2 | microsoft/swinv2-base-patch4-window16-256 |
| Vision | ViT | google/vit-base-patch16-224 |
| LLM | Mistral-7B | mistralai/Mistral-7B-v0.1 |
| LLM | Llama-2-13B | meta-llama/Llama-2-13b-hf |
After running the dropping pipeline, drop_attn_list and drop_mlp_list are written into the model's config.json. Example configurations:
// Drop attention layers only
{ "drop_attn_list": [25, 26, 24, 22], "drop_mlp_list": [] }
// Drop MLP layers only
{ "drop_attn_list": [], "drop_mlp_list": [26, 27, 25, 24] }
// Drop full blocks
{ "drop_attn_list": [26, 25, 24, 27], "drop_mlp_list": [26, 25, 24, 27] }Custom model classes are stored under src/llmtuner/compression/prune/models/ and referenced via auto_map in the config.
# Vision models
bash scripts/dropping/vision_block_drop.sh
bash scripts/dropping/vision_layer_drop.sh
bash scripts/dropping/vision_layer_drop_joint.sh
# LLMs
bash scripts/dropping/block_drop.sh
bash scripts/dropping/layer_drop.sh
bash scripts/dropping/layer_drop_joint.shEach script runs in two phases:
- Similarity estimation — computes cosine similarity between layer inputs and outputs on a calibration set, identifies which layers to drop, and saves the config.
- Post-dropping — applies the dropped config to the model checkpoint.
Similarity results are cached as .pt files under results_prune/cache/ so re-running with the same settings skips recomputation.
Evaluates a dropped vision model on a dataset: optionally fine-tunes the classification head, runs inference on the test split, and saves logits/predictions to HDF5. Also prints per-layer execution verification (confirming which attention/MLP sublayers were actually skipped).
bash scripts/benchmark/benchmark_vm_eval.shKey arguments (edit the script or call directly):
CUDA_VISIBLE_DEVICES=0,1 accelerate launch src/vm-eval/benchmark_vision.py \
--model_name_or_path ./dinov2_model \
--dataset LCZ42 \
--dataset_base_dir data \
--prune_method layer_drop_attn \
--drop_num 4 \
--finetune_head \
--epochs 20 \
--lr 0.001 \
--weight_decay 0.03 \
--batch_size 32 \
--batch_size_eval 10 \
--output_file results/dinov2_lcz42_drop4.outMeasures throughput (images/s), latency, memory, and FLOPs for vision models. FLOPs computation respects the dropped layers in the config.
bash scripts/benchmark/benchmark_vm_speed.shThe Speedup Degradation Ratio (γ = ΔAccuracy / ΔSpeedup) measures accuracy cost per unit of throughput gain. Lower γ = more efficient compression.
bash scripts/visualization/compute_sdr_all.shResults are written to src/visualization/sdr_results/.
bash scripts/benchmark/benchmark_lm_eval.sh- This benchmark depends on EleutherAI/lm-evaluation-harness.
- For strict reproduction, the repo uses this fork: s1ghhh/lm-evaluation-harness.
- Use modeling files in
src/llmtuner/modelwhen loading Mistral/Llama with dropped configs.
Use the model files in src/llmtuner/model/ when loading Mistral/Llama with dropped configs.
bash scripts/benchmark/benchmark_speed.shEdit model_path, save_file, and model_type in the script before running.
Quantization benchmarks (AWQ / GPTQ)
bash scripts/quantization/awq.sh
bash scripts/quantization/gptq.shEdit model_path and quant_path in those scripts and ensure CUDA-compatible package versions are installed (see Installation).
The model selection pipeline automates the full search over all combinations of architecture, pruning method, and drop count for a given dataset. It runs in three phases:
- Baseline — fine-tunes the classification head at
drop=0to establish a reference accuracy. - Search — quick fine-tuning (few epochs) across all
(arch × method × drop_n)variants. Early stopping halts a search direction if accuracy drops more than a configurable threshold below baseline. - Deep fine-tune — full fine-tuning of the winning variant, followed by final test-set evaluation.
Results are logged to results_selection/ with per-variant accuracy, the selected winner, and the final test metrics.
bash scripts/model-selection/run_selection.shKey parameters (edit run_selection.sh):
dataset="LCZ42" # Target dataset
architectures="dinov2 dinov3_vit swinv2 vit" # Architectures to search
prune_methods="block_drop layer_drop_attn layer_drop_mlp layer_drop_all"
drop_step=4 # Evaluate drop counts: 4, 8, 12, ...
early_stop_threshold=0.05 # Stop if accuracy drops >5% below baseline
baseline_epochs=5
search_epochs=5
deep_epochs=20Show all 10 datasets
| Dataset | Domain | Classes | Notes |
|---|---|---|---|
imagenet-1k |
Natural images | 1000 | Standard benchmark; head fine-tuning skipped if classes match |
cifar10 |
Natural images | 10 | |
LCZ42 |
Remote sensing | 17 | Urban morphology classification |
CrossD |
Cross-domain | varies | Multi-domain classification |
zoolake |
Microscopy | varies | Zooplankton identification |
lar |
Medical | varies | Laryngeal endoscopy |
InfLarynge |
Medical | varies | Inflamed laryngeal tissue |
DAPlankton |
Microscopy | varies | Plankton imaging |
Bark |
Texture | varies | Tree bark classification |
Pest |
Agriculture | varies | Crop pest identification |
All datasets are stored as stratified HDF5 splits (train.h5, val.h5, test.h5) under data/<dataset>/. Preprocessing scripts are in data/.
- Alessandro Viespoli (
alessandro.viespoli@studenti.unipd.it)