Skip to content

Gumpest/SEED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SEED: Targeted Data Selection by Weighted Independent Set

Yuan Zhang1, Lifeng Guo2, Junwen Pan3, Wenzhao Zheng4,

Wen Zhou5, Kuan Cheng1, Kurt Keutzer4, Shanghang Zhang1✉️

1School of Computer Science, Peking University

2Beijing University of Posts and Telecommunications, 3Tianjin University

4EECS, UC Berkeley, 5Chinese Academy of Sciences

Paper Code

📜 News

🔥 [2026/05/18] We released SEED and its Code is now open-source!

👀 Overview

Overview of SEED. SEED formulates subset selection as a Weighted Independent Set problem over a similarity graph constructed from training data, with better node weights from a mutual influence subspace and better edges from local scale normalization. The resulting structurally balanced graph enables selecting a compact, diverse, and high-influence subset. Different colors indicate that nodes belong to different domains, while the color intensity represents the node weights.

image

👨‍💻 Preparation

  1. Clone this repository and navigate to SEED folder
git clone https://github.com/Gumpest/SEED.git
cd SEED
  1. Install necessary package
conda create -n seed python=3.10 -y
conda activate seed

pip install torch==2.1.2 torchvision torchaudio
pip install -r requirement.txt
  1. Install SEED
pip install -e .
  1. Prepare Training and Target Data

    4.1 Instruction Tuning

    • Training datasets: Flan v2, COT, Dolly, and Open Assistant.

    • Target datasets: MMLU, Tydiqa, and BBH.

    • A processed version of these files are available in Google Drive.

    4.2 Visual Instruction Tuning

    • Training datasets: Honeybee-Remake-SEED-200K available in HuggingFace.

    • Target datasets: random 5% of benchmark datasets.

🎯 Quick Start

We provide a complete example pipeline (LLaMA3-8B) for the instruction tuning task, covering data selection, model training, and evaluation. All commands are organized as shell scripts for easy reproduction and one-command execution. Please remember to replace the default paths with your own local paths before running the scripts.

Data Selection with SEED

  1. Warmup training (5% random data)
bash shell/1_warmup.sh
  1. Collect the target gradient datastore
bash shell/2_gradient_train.sh

Note

Gradient collection must be performed separately for each dataset by manually switching the corresponding comments four times.

  1. Collect the target gradient datastore
bash shell/3_gradient_val.sh
  1. Select data with SEED
bash shell/4_select.sh

Training

  1. Train the model with selected data
bash shell/5_train.sh

Evaluation

  1. Evaluate the model
bash evaluation/batch_eval.sh
  1. Print your results
python evaluation/print_res.py

The results are shown as follows:

================== Summary Table ==================

Task       | Checkpoint   | Score   
------------------------------------
tydiqa     | 211          | 57.5664 
mmlu       | 211          | 0.6513  
bbh        | 317          | 0.6676  

==================================================

License

This project is released under the Apache 2.0 license.

Citation

If you use SEED in your research, please cite our work by using the following BibTeX entry:

@article{zhang2026seed,
  title={SEED: Targeted Data Selection by Weighted Independent Set},
  author={Zhang, Yuan and Guo, Lifeng and Pan, Junwen and Liu, Chang and Zheng, Wenzhao and Cheng, Kuan and Keutzer, Kurt and Zhang, Shanghang},
  journal={arXiv preprint arXiv:2605.15691},
  year={2026}
}

Acknowledgment

We extend our gratitude to the open-source efforts of LESS, FAISS, HoneyBee.

About

Official implementation of paper "SEED: Targeted Data Selection by Weighted Independent Set".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors