Skip to content

xxayt/SEATS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SEATS

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Zijie Xin1Jie Yang2,📧Ruixiang Zhao1Tianyi Wang2Fengyun Rao2Jing Lyu2Xirong Li1,📧
📧 Corresponding authors
1 Renmin University of China  2 WeChat Vision, Tencent Inc. 

📢 News

  • [2026/06/07] 🚀 Released SEATS code for Qwen2.5-Omni-7B, with LMMs-Eval adaptation and baselines.
  • [2026/05/19] 📄 Paper released on arXiv and project page is online.

👀 Overview

SEATS is a training-free, stage-adaptive token selection method for efficient omni-modal LLM inference. By analyzing layer-wise token dependency, it reveals that visual and audio dependencies follow a block-wise pattern and weaken with depth. SEATS removes spatiotemporal redundancy before the LLM, progressively prunes tokens inside the LLM, and fully removes non-textual tokens in late layers.

✨ Key Highlights

  • 💡 New Insight: Reveals a block-wise dependence pattern in omni-modal LLMs, where reliance on visual and audio tokens weakens with layer depth.
  • Strong Efficiency: 9.3x FLOPs reduction and 4.8x prefill speedup at 10% token retention while preserving 96.3% performance.
  • 🎯 Stage-adaptive Design: Diversity-based pre-LLM selection + query-guided inner-LLM progressive pruning + late-layer full removal.
  • 🔌 Broad Compatibility: Plug-and-play and training-free for direct application to Qwen2.5-Omni-7B and Qwen3-Omni-30B.

📅 TODO

  • Support Qwen2.5-Omni-7B
  • Release benchmark adaptation code for LMMs-Eval (WorldSense, Daily-Omni, OmniVideoBench, Video-MME, LVOmniBench)
  • Evaluation scripts and reproduction guide (adapted for LMMs-Eval)
  • Release more baseline implementations (FastV, VisionZip, Random)
  • Support Qwen3-Omni-30B
  • Release more baseline implementations (DivPrune, DyCoke, and OmniZip)
  • future work: Support more models (OmniVinci-7B)

🏗️ Method

Method SEATS is a three-stage method:

  1. Pre-LLM Token Selection: Removes spatiotemporal redundancy within each temporal window via attention-weighted diversity selection.
  2. Inner-LLM Token Selection: Progressively prunes tokens with a block-wise token retention ratio decay schedule and top-down budget allocation (inter-window then intra-window) guided by query relevance.
  3. Late-block Removal: Removes all remaining non-textual tokens in late layers where cross-modal fusion is complete.

🔧 Dependencies and Installation

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.

# git clone this repository
git clone https://github.com/xxayt/SEATS.git
cd SEATS

# create a new anaconda env
conda create -n SEATS_env python=3.10 -y
conda activate SEATS_env

# install dependencies
bash scripts/base/setup.sh

# install the bundled lmms-eval in editable mode
cd lmms-eval
pip install -e .
cd ..

# (Recommended) install torch and flash-attn
# pip install torch==2.8.0 torchvision==0.23.0
pip install flash-attn --no-build-isolation

🚀 Evaluation

We adapt 5 omni-modal benchmarks into LMMs-Eval, so you can run them directly through this repository. Please first download the corresponding annotation data and videos from the links below.

Benchmark Data Videos Task name
Daily-Omni xxayt/Daily-Omni liarliar/Daily-Omni dailyomni
WorldSense lmms-lab/WorldSense lmms-lab/WorldSense worldsense
OmniVideoBench xxayt/OmniVideoBench NJU-LINK/OmniVideoBench omnivideobench
Video-MME lmms-lab/Video-MME lmms-lab/Video-MME videomme
LVOmniBench xxayt/LVOmniBench KD-TAO/LVOmniBench lvomnibench

Once the data is ready, launch evaluation with the scripts under scripts/. Results are written to output/. We implement qwen2_5_omni_zip as a unified LMMs-Eval model wrapper that dispatches to SEATS and all baselines for omni-modal LLM token compression.

Full tokens

bash scripts/eval_qwen2_5_omni_full_tokens.sh

SEATS (our method)

To evaluate our SEATS method on the five benchmarks, use the following command:

bash scripts/eval_qwen2_5_omni_seats.sh

You can customize the compression settings by editing:

  • scripts/eval_qwen2_5_omni_seats.shtasks_list (which benchmarks to run) and ratio_pairs (per-modality token retention budgets, swept over multiple settings).
  • seats/config.yaml — SEATS method hyperparameters (e.g., progressive drop layers, late-block layer, window size).

Baselines

We also provide the following scripts to evaluate the baseline methods adapted for omni-modal LLMs:

bash scripts/eval_qwen2_5_omni_random.sh         # Random
bash scripts/eval_qwen2_5_omni_fastv.sh          # FastV
bash scripts/eval_qwen2_5_omni_fastv_omni.sh     # FastV-om
bash scripts/eval_qwen2_5_omni_visionzip.sh      # VisionZip
bash scripts/eval_qwen2_5_omni_visionzip_omni.sh # VisionZip-om

... # more to be added

📁 Repo Structure

SEATS/
├── scripts/                          # Shell entry points (one per method) + shared base
│   ├── base/
│   │   ├── setup.sh                  # Python dependency installation
│   │   └── eval_qwen2_5_omni_zip.sh  # Shared accelerate + lmms-eval launcher
│   ├── eval_qwen2_5_omni_seats.sh    # SEATS (our method)
│   └── ...
├── seats/                            # SEATS three-stage implementation
│   ├── pre_llm_units.py              # Stage I: winDivPrune
│   ├── inner_llm_units.py            # Stage II: inner-LLM stage-adaptive selection
│   ├── ratio_decay_scheduler.py      # block-wise TRR decay schedule
│   ├── modeling_qwen2_5_omni_seats.py # patched Thinker / TextModel forwards
│   └── config.yaml                   # SEATS hyperparameters
├── baselines/                        # Per-method patches; one subfolder per baseline
│   ├── utils.py                      # apply_zip_method_patch() dispatcher
│   ├── full_tokens/                  # No compression (config only)
│   ├── visionzip_omni/               # VisionZip adapted for omni-modal
│   └── ...
├── models/qwen2_5_omni/              # Vendored Qwen2.5-Omni model code
└── lmms-eval/                        # Vendored LMMs-Eval (registers `qwen2_5_omni_zip`)

🤝 Acknowledgement

This implementation relies on resources from Qwen2.5-Omni, Qwen3-Omni, LMMs-Eval, OmniZip, VisionZip, and DivPrune. We thank the original authors for their excellent contributions and for making their work publicly available.

✏️ Citation

If you find this work useful, please consider citing:

@article{xin2026seats,
  title={Stage-adaptive Token Selection for Efficient Omni-modal LLMs},
  author={Xin, Zijie and Yang, Jie and Zhao, Ruixiang and Wang, Tianyi and Rao, Fengyun and Lyu, Jing and Li, Xirong},
  journal={arXiv preprint arXiv:2605.20035},
  year={2026}
}

📜 License

This project is licensed under the MIT License. For commercial licensing or any use beyond research, please contact the authors.

📬 Contact for Issues

For any questions about this project (e.g., corrupted files or loading errors), please reach out at: xinzijie@ruc.edu.cn

About

This repo is the official implementation of "Stage-adaptive Token Selection for Efficient Omni-modal LLMs"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors