Skip to content

Eric-qi/DiT-IC

Repository files navigation

⚡ DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

🔥 News

  • [2026/03/15] 🎉 Code and pre-trained models are officially released!
  • [2026/02/21] 🏆 DiT-IC is accepted by CVPR 2026!

📖 Introduction

DiT-IC is a high-performance neural image compression framework that leverages the power of Diffusion Transformers (DiT). By bridging latent diffusion models with standard entropy coding pipelines, DiT-IC achieves state-of-the-art perceptual quality with high efficiency.

✨ Key Features

  • Novel Architecture: The first Diffusion Transformer tailored for high-fidelity image reconstruction.
  • Aligned LoRA Adaptation: Efficient fine-tuning via proposed alignment mechanisms, significantly accelerating training process.
  • High Efficiency: 32x latent space diffusion ensures faster inference and lower memory consumption compared to other models.
  • Deploy-Ready: Fully compatible with standard entropy coding and easy-to-extend API and your own codecs.

📑 Table of Contents

📊 Checkpoints and Performance

We provide scalable model configurations by adjusting the VAE/DiT LoRA ranks.

We provide four operating points (quality=1,2,3,4, corresponding to λ ∈ [2⁴, 2⁻²]), which achieve approximate bitrates of 0.009, 0.044, 0.081, and 0.100 bpp on the Kodak dataset, respectively.

👉 Download Models: HuggingFace - DiT-IC Rank64/128 Checkpoints

VAE/Diff.
Lora Rank
λ 📷 Kodak 🏆 CLIC 🌄 DIV2K
BPP ↓ LPIPS ↓ DISTS ↓ BPP ↓ LPIPS ↓ DISTS ↓ FID ↓ BPP ↓ LPIPS ↓ DISTS ↓ FID ↓
32/64 0.5 0.080 0.094 0.059 0.079 0.067 0.038 3.06 0.089 0.084 0.043 7.22
64/128 0.5 0.081 0.091 0.055 0.072 0.070 0.034 2.66 0.084 0.086 0.039 6.52

Detailed performance logs for the original results (rank 32/64) and extended results (rank 64/128) can be found in the results/ directory.

🔧 Installation

Requirements

  • Python = 3.12
  • PyTorch = 2.8
  • CompressAI == 1.2.8 (⚠️ Crucial for consistent bitrate/BPP calculation)

Other environments may also work, but they have not been tested.

Install dependencies

pip install -r requirements.txt

💻 Quick Start: Inference

1️⃣ Prepare Datasets

The following datasets are used for evaluation:

Dataset Description
Kodak 24 natural images (768×512)
DIV2K Validation 100 high-resolution images
CLIC 2020 Test 428 high-resolution images

You may also evaluate the model on your own datasets.

2️⃣ Run Compression

Option A: Inference with Merged Weights (Recommended)

The released checkpoints merge LoRA weights into the base model, which simplifies deployment and speeds up inference.

CUDA_VISIBLE_DEVICES=0 python -u compress.py \
        --config_path="configs/inference_merge.yaml" \
        --codec_path="checkpoints/q3_merge_ema.pt" \
        --img_path="/data/data/Kodak" \
        --rec_path="results/Kodak/rec/" \
        --bin_path="results/Kodak/bin/" \
        --use_merge \
        2>&1 | tee results/logs/eval_ema_Kodak_q3_$(date +%Y%m%d_%H%M%S).log

Option B: Inference with Raw LoRA Checkpoints

If you trained the model yourself and the LoRA weights are still not merged, run:

CUDA_VISIBLE_DEVICES=0 python -u compress.py \
        --config_path="configs/inference.yaml" \
        --codec_path="checkpoints/0.5lrd_lora_0050000.pt" \
        --img_path="/data/data/Kodak" \
        --rec_path="results/Kodak/rec/" \
        --bin_path="results/Kodak/bin/" \
        2>&1 | tee logs/eval_kodak_0.5lrd_$(date +%Y%m%d_%H%M%S).log

Bitstreams will be saved in --bin_path, and reconstructed images in --rec_path.

Logs will be stored in the results/logs/.

Argument Description
--use_ema Use EMA weights
--save_img Save reconstructed images
--entropy_estimation Estimate bitrate without real entropy coding

3️⃣ Evaluation (Optional)

To compute image quality metrics using the saved reconstruction folders:

CUDA_VISIBLE_DEVICES=0 python -m eval.evaluate \
    --recon_dir "results/Kodak/rec/" \
    --gt_dir "/data/data/Kodak" \
    2>&1 | tee logs/eval_kodak_$(date +%Y%m%d_%H%M%S).log

🚗 Quick Start: Training

1️⃣ Prepare Training Datasets

Download the following datasets and place them in the corresponding directories:

Dataset Size
LSDIR ~50K images
MLIC-Train-100K ~100K images

2️⃣ Train from Scratch

Download the required pretrained components:

Place them into SANA/ and ELIC/weights.

A simplified training pipeline is provided below.

Stage 1 — Train w/o GAN loss

CUDA_VISIBLE_DEVICES=0,1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTHONPATH=. torchrun --standalone --nproc_per_node=2 --nnodes=1 train_nogan_ddp.py --config configs/train_256_nogan.yaml

Stage 2 — Finetune w/ GAN loss

CUDA_VISIBLE_DEVICES=0,1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTHONPATH=. torchrun --standalone --nproc_per_node=2 --nnodes=1 train_ddp.py --config configs/train_256_gan.yaml

--nproc_per_node is equal to the number of your available GPUs. You can use single GPU training.

💡 The rate–distortion trade-off parameter λ can be either kept the same across stages or adjusted during Stage 2.

Examples:

  • Stage1 λ = 2.0 → Stage2 λ = 2.0

  • Stage1 λ = 0.5 → Stage2 λ = 2.0

💡 More advanced strategies, such as incorporating image–text embeddings (e.g., CLIP loss), better GAN models, or more refined training pipelines, may further improve performance or accelerate training. Users can modify the configuration files according to their own requirements and hardware setups.

3️⃣ Finetune Pretrained Model

If you start from our pretrained checkpoints:

👉 Download Models: HuggingFace - DiT-IC Rank64/128 Checkpoints

Then finetune at the target bitrate:

CUDA_VISIBLE_DEVICES=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True PYTHONPATH=. torchrun --standalone --nproc_per_node=1 --nnodes=1 train_ddp.py --config configs/train_merge_256_gan.yaml

4️⃣ Merge LoRA Weights

Training checkpoints contain:

  • trained LoRA weights

  • trained codec parameters

For faster inference, you may merge LoRA weights into the base model:

CUDA_VISIBLE_DEVICES=0 python merge.py \
        --config_path="configs/inference.yaml" \
        --codec_path="checkpoints/0.5lrd_lora_0050000.pt"

You can add --use_ema to enable EMA weights.

The merged checkpoint will be saved in the same --codec_path directory.

🧩 More Features

You can modify configuration files under configs/ to adapt the framework to different settings:

  • datasets scales

  • training images resolutions

  • hardware constraints (e.g., memory, FLOPs)

Planned future updates:

  • FP16 / BF16 inference

  • Tiled inference

📖 BibTeX

If you find this project useful, please cite:

@inproceedings{shi2026ditic,
  title={DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression},
  author={Shi Junqi, Lu Ming, Li Xingchen, Ke Anle, Zhang Ruiqi and Ma Zhan},
  booktitle={CVPR},
  year={2026}
}

🥰 Acknowledgement

Thanks to the following open-sourced codebase for their wonderful work and codebase!


⭐️ If you find this project helpful, please give it a star! ⭐️

About

[CVPR 2026] DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages