- [2026/06/07] 🚀 Released SEATS code for Qwen2.5-Omni-7B, with LMMs-Eval adaptation and baselines.
- [2026/05/19] 📄 Paper released on arXiv and project page is online.
SEATS is a training-free, stage-adaptive token selection method for efficient omni-modal LLM inference. By analyzing layer-wise token dependency, it reveals that visual and audio dependencies follow a block-wise pattern and weaken with depth. SEATS removes spatiotemporal redundancy before the LLM, progressively prunes tokens inside the LLM, and fully removes non-textual tokens in late layers.
- 💡 New Insight: Reveals a block-wise dependence pattern in omni-modal LLMs, where reliance on visual and audio tokens weakens with layer depth.
- ⚡ Strong Efficiency: 9.3x FLOPs reduction and 4.8x prefill speedup at 10% token retention while preserving 96.3% performance.
- 🎯 Stage-adaptive Design: Diversity-based pre-LLM selection + query-guided inner-LLM progressive pruning + late-layer full removal.
- 🔌 Broad Compatibility: Plug-and-play and training-free for direct application to Qwen2.5-Omni-7B and Qwen3-Omni-30B.
- Support Qwen2.5-Omni-7B
- Release benchmark adaptation code for LMMs-Eval (WorldSense, Daily-Omni, OmniVideoBench, Video-MME, LVOmniBench)
- Evaluation scripts and reproduction guide (adapted for LMMs-Eval)
- Release more baseline implementations (FastV, VisionZip, Random)
- Support Qwen3-Omni-30B
- Release more baseline implementations (DivPrune, DyCoke, and OmniZip)
- future work: Support more models (OmniVinci-7B)
SEATS is a three-stage method:
- Pre-LLM Token Selection: Removes spatiotemporal redundancy within each temporal window via attention-weighted diversity selection.
- Inner-LLM Token Selection: Progressively prunes tokens with a block-wise token retention ratio decay schedule and top-down budget allocation (inter-window then intra-window) guided by query relevance.
- Late-block Removal: Removes all remaining non-textual tokens in late layers where cross-modal fusion is complete.
We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.
# git clone this repository
git clone https://github.com/xxayt/SEATS.git
cd SEATS
# create a new anaconda env
conda create -n SEATS_env python=3.10 -y
conda activate SEATS_env
# install dependencies
bash scripts/base/setup.sh
# install the bundled lmms-eval in editable mode
cd lmms-eval
pip install -e .
cd ..
# (Recommended) install torch and flash-attn
# pip install torch==2.8.0 torchvision==0.23.0
pip install flash-attn --no-build-isolationWe adapt 5 omni-modal benchmarks into LMMs-Eval, so you can run them directly through this repository. Please first download the corresponding annotation data and videos from the links below.
| Benchmark | Data | Videos | Task name |
|---|---|---|---|
| Daily-Omni | xxayt/Daily-Omni | liarliar/Daily-Omni | dailyomni |
| WorldSense | lmms-lab/WorldSense | lmms-lab/WorldSense | worldsense |
| OmniVideoBench | xxayt/OmniVideoBench | NJU-LINK/OmniVideoBench | omnivideobench |
| Video-MME | lmms-lab/Video-MME | lmms-lab/Video-MME | videomme |
| LVOmniBench | xxayt/LVOmniBench | KD-TAO/LVOmniBench | lvomnibench |
Once the data is ready, launch evaluation with the scripts under scripts/. Results are written to output/. We implement qwen2_5_omni_zip as a unified LMMs-Eval model wrapper that dispatches to SEATS and all baselines for omni-modal LLM token compression.
bash scripts/eval_qwen2_5_omni_full_tokens.shTo evaluate our SEATS method on the five benchmarks, use the following command:
bash scripts/eval_qwen2_5_omni_seats.shYou can customize the compression settings by editing:
scripts/eval_qwen2_5_omni_seats.sh—tasks_list(which benchmarks to run) andratio_pairs(per-modality token retention budgets, swept over multiple settings).seats/config.yaml— SEATS method hyperparameters (e.g., progressive drop layers, late-block layer, window size).
We also provide the following scripts to evaluate the baseline methods adapted for omni-modal LLMs:
bash scripts/eval_qwen2_5_omni_random.sh # Random
bash scripts/eval_qwen2_5_omni_fastv.sh # FastV
bash scripts/eval_qwen2_5_omni_fastv_omni.sh # FastV-om
bash scripts/eval_qwen2_5_omni_visionzip.sh # VisionZip
bash scripts/eval_qwen2_5_omni_visionzip_omni.sh # VisionZip-om
... # more to be addedSEATS/
├── scripts/ # Shell entry points (one per method) + shared base
│ ├── base/
│ │ ├── setup.sh # Python dependency installation
│ │ └── eval_qwen2_5_omni_zip.sh # Shared accelerate + lmms-eval launcher
│ ├── eval_qwen2_5_omni_seats.sh # SEATS (our method)
│ └── ...
├── seats/ # SEATS three-stage implementation
│ ├── pre_llm_units.py # Stage I: winDivPrune
│ ├── inner_llm_units.py # Stage II: inner-LLM stage-adaptive selection
│ ├── ratio_decay_scheduler.py # block-wise TRR decay schedule
│ ├── modeling_qwen2_5_omni_seats.py # patched Thinker / TextModel forwards
│ └── config.yaml # SEATS hyperparameters
├── baselines/ # Per-method patches; one subfolder per baseline
│ ├── utils.py # apply_zip_method_patch() dispatcher
│ ├── full_tokens/ # No compression (config only)
│ ├── visionzip_omni/ # VisionZip adapted for omni-modal
│ └── ...
├── models/qwen2_5_omni/ # Vendored Qwen2.5-Omni model code
└── lmms-eval/ # Vendored LMMs-Eval (registers `qwen2_5_omni_zip`)
This implementation relies on resources from Qwen2.5-Omni, Qwen3-Omni, LMMs-Eval, OmniZip, VisionZip, and DivPrune. We thank the original authors for their excellent contributions and for making their work publicly available.
If you find this work useful, please consider citing:
@article{xin2026seats,
title={Stage-adaptive Token Selection for Efficient Omni-modal LLMs},
author={Xin, Zijie and Yang, Jie and Zhao, Ruixiang and Wang, Tianyi and Rao, Fengyun and Lyu, Jing and Li, Xirong},
journal={arXiv preprint arXiv:2605.20035},
year={2026}
}This project is licensed under the MIT License. For commercial licensing or any use beyond research, please contact the authors.
For any questions about this project (e.g., corrupted files or loading errors), please reach out at: xinzijie@ruc.edu.cn

