- [2026/03/15] 🎉 Code and pre-trained models are officially released!
- [2026/02/21] 🏆 DiT-IC is accepted by CVPR 2026!
DiT-IC is a high-performance neural image compression framework that leverages the power of Diffusion Transformers (DiT). By bridging latent diffusion models with standard entropy coding pipelines, DiT-IC achieves state-of-the-art perceptual quality with high efficiency.
- Novel Architecture: The first Diffusion Transformer tailored for high-fidelity image reconstruction.
- Aligned LoRA Adaptation: Efficient fine-tuning via proposed alignment mechanisms, significantly accelerating training process.
- High Efficiency: 32x latent space diffusion ensures faster inference and lower memory consumption compared to other models.
- Deploy-Ready: Fully compatible with standard entropy coding and easy-to-extend API and your own codecs.
- Checkpoints & Performance
- Installation
- Quick Start: Inference
- Quick Start: Training
- More Features
- BibTeX
We provide scalable model configurations by adjusting the VAE/DiT LoRA ranks.
We provide four operating points (quality=1,2,3,4, corresponding to λ ∈ [2⁴, 2⁻²]), which achieve approximate bitrates of 0.009, 0.044, 0.081, and 0.100 bpp on the Kodak dataset, respectively.
👉 Download Models: HuggingFace - DiT-IC Rank64/128 Checkpoints
| VAE/Diff. Lora Rank |
λ | 📷 Kodak | 🏆 CLIC | 🌄 DIV2K | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BPP ↓ | LPIPS ↓ | DISTS ↓ | BPP ↓ | LPIPS ↓ | DISTS ↓ | FID ↓ | BPP ↓ | LPIPS ↓ | DISTS ↓ | FID ↓ | ||
| 32/64 | 0.5 | 0.080 | 0.094 | 0.059 | 0.079 | 0.067 | 0.038 | 3.06 | 0.089 | 0.084 | 0.043 | 7.22 |
| 64/128 | 0.5 | 0.081 | 0.091 | 0.055 | 0.072 | 0.070 | 0.034 | 2.66 | 0.084 | 0.086 | 0.039 | 6.52 |
Detailed performance logs for the original results (rank 32/64) and extended results (rank 64/128) can be found in the
results/directory.
- Python = 3.12
- PyTorch = 2.8
- CompressAI == 1.2.8 (
⚠️ Crucial for consistent bitrate/BPP calculation)
Other environments may also work, but they have not been tested.
pip install -r requirements.txt
The following datasets are used for evaluation:
| Dataset | Description |
|---|---|
| Kodak | 24 natural images (768×512) |
| DIV2K Validation | 100 high-resolution images |
| CLIC 2020 Test | 428 high-resolution images |
You may also evaluate the model on your own datasets.
The released checkpoints merge LoRA weights into the base model, which simplifies deployment and speeds up inference.
CUDA_VISIBLE_DEVICES=0 python -u compress.py \
--config_path="configs/inference_merge.yaml" \
--codec_path="checkpoints/q3_merge_ema.pt" \
--img_path="/data/data/Kodak" \
--rec_path="results/Kodak/rec/" \
--bin_path="results/Kodak/bin/" \
--use_merge \
2>&1 | tee results/logs/eval_ema_Kodak_q3_$(date +%Y%m%d_%H%M%S).logIf you trained the model yourself and the LoRA weights are still not merged, run:
CUDA_VISIBLE_DEVICES=0 python -u compress.py \
--config_path="configs/inference.yaml" \
--codec_path="checkpoints/0.5lrd_lora_0050000.pt" \
--img_path="/data/data/Kodak" \
--rec_path="results/Kodak/rec/" \
--bin_path="results/Kodak/bin/" \
2>&1 | tee logs/eval_kodak_0.5lrd_$(date +%Y%m%d_%H%M%S).logBitstreams will be saved in
--bin_path, and reconstructed images in--rec_path.
Logs will be stored in the
results/logs/.
| Argument | Description |
|---|---|
--use_ema |
Use EMA weights |
--save_img |
Save reconstructed images |
--entropy_estimation |
Estimate bitrate without real entropy coding |
To compute image quality metrics using the saved reconstruction folders:
CUDA_VISIBLE_DEVICES=0 python -m eval.evaluate \
--recon_dir "results/Kodak/rec/" \
--gt_dir "/data/data/Kodak" \
2>&1 | tee logs/eval_kodak_$(date +%Y%m%d_%H%M%S).logDownload the following datasets and place them in the corresponding directories:
| Dataset | Size |
|---|---|
| LSDIR | ~50K images |
| MLIC-Train-100K | ~100K images |
Download the required pretrained components:
Place them into SANA/ and ELIC/weights.
A simplified training pipeline is provided below.
CUDA_VISIBLE_DEVICES=0,1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTHONPATH=. torchrun --standalone --nproc_per_node=2 --nnodes=1 train_nogan_ddp.py --config configs/train_256_nogan.yamlCUDA_VISIBLE_DEVICES=0,1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTHONPATH=. torchrun --standalone --nproc_per_node=2 --nnodes=1 train_ddp.py --config configs/train_256_gan.yaml
--nproc_per_nodeis equal to the number of your available GPUs. You can use single GPU training.
💡 The rate–distortion trade-off parameter λ can be either kept the same across stages or adjusted during Stage 2.
Examples:
-
Stage1 λ = 2.0 → Stage2 λ = 2.0
-
Stage1 λ = 0.5 → Stage2 λ = 2.0
💡 More advanced strategies, such as incorporating image–text embeddings (e.g., CLIP loss), better GAN models, or more refined training pipelines, may further improve performance or accelerate training. Users can modify the configuration files according to their own requirements and hardware setups.
If you start from our pretrained checkpoints:
👉 Download Models: HuggingFace - DiT-IC Rank64/128 Checkpoints
Then finetune at the target bitrate:
CUDA_VISIBLE_DEVICES=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTHONPATH=. torchrun --standalone --nproc_per_node=1 --nnodes=1 train_ddp.py --config configs/train_merge_256_gan.yamlTraining checkpoints contain:
-
trained LoRA weights
-
trained codec parameters
For faster inference, you may merge LoRA weights into the base model:
CUDA_VISIBLE_DEVICES=0 python merge.py \
--config_path="configs/inference.yaml" \
--codec_path="checkpoints/0.5lrd_lora_0050000.pt"You can add --use_ema to enable EMA weights.
The merged checkpoint will be saved in the same
--codec_pathdirectory.
You can modify configuration files under configs/ to adapt the framework to different settings:
-
datasets scales
-
training images resolutions
-
hardware constraints (e.g., memory, FLOPs)
Planned future updates:
-
FP16 / BF16 inference
-
Tiled inference
If you find this project useful, please cite:
@inproceedings{shi2026ditic,
title={DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression},
author={Shi Junqi, Lu Ming, Li Xingchen, Ke Anle, Zhang Ruiqi and Ma Zhan},
booktitle={CVPR},
year={2026}
}
Thanks to the following open-sourced codebase for their wonderful work and codebase!
⭐️ If you find this project helpful, please give it a star! ⭐️

