TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

Peer Rheinboldt · Frédéric Berdoz · Roger Wattenhofer

Preprint, submitted June 2026

sidebyside_sd.mp4

Quick Start

TreeFlash requires trust_remote_code=True because the drafter architecture and spec_generate method are provided by this repository.

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

drafter = AutoModel.from_pretrained(
    "peerrh/treeflash-qwen3-4b",
    trust_remote_code=True,
    dtype="bfloat16",
    device_map="cuda:0",
).eval()

target = AutoModelForCausalLM.from_pretrained(
    "qwen/qwen3-4b",
    trust_remote_code=True,
    dtype="bfloat16",
    device_map="cuda:0",
).eval()

tokenizer = AutoTokenizer.from_pretrained("qweb/qwen3-4b", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer([text], return_tensors="pt").to(drafter.device)

output_ids = drafter.spec_generate(
    target=target,
    input_ids=inputs["input_ids"],
    max_new_tokens=2048,
    stop_token_ids=[tokenizer.eos_token_id],
    temperature=0.0,
    drafter_temperature=1.0,
    tree_size=64,
    top_m=16,
)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Supported Models

Target	Drafter
Qwen/Qwen3-4B	peerrh/treeflash-qwen3-4b
Qwen/Qwen3-8B	peerrh/treeflash-qwen3-8b
Qwen/Qwen3-Coder-30B-A3B-Instruct	peerrh/treeflash-qwen3-coder-30b-a3b

Citation

If you use TreeFlash, please cite:

@article{rheinboldt2026treeflash,
  title={TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding},
  author={Rheinboldt, Peer and Berdoz, Fr{\'e}d{\'e}ric and Wattenhofer, Roger},
  journal={arXiv preprint arXiv:2606.03819},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
benchmark.sh		benchmark.sh
requirements.txt		requirements.txt
tree_flash.py		tree_flash.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

Quick Start

Supported Models

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

Quick Start

Supported Models

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages