LongBench-v2 sampled datasets for `vllm bench serve` / `vllm-moreh bench serve`

Real long-context prompts sampled from LongBench-v2 (/remote/vast0/share-mv/zai-org/LongBench-v2/data.json) for serving benchmarks.

Location: /remote/vast0/share-mv/longbenchv2-custom/ Generated by: sample_longbench_v2.py (in this dir).

Files

File	Target ISL (tokens)	Prompts	Recommended OSL
`longbenchv2-8k.jsonl`	8,192	256	1024
`longbenchv2-10k.jsonl`	10,000	256	500
`longbenchv2-100k.jsonl`	100,000	100	500
`longbenchv2-1M.jsonl`	1,000,000	22	500

longbenchv2-manifest.json records the achieved token range per file.

The 1M file has only 22 prompts — that is every LongBench-v2 entry whose real context reaches 1,000,000 GLM-5 tokens (no synthetic padding / repetition). Keep --num-prompts <= 22 to use only unique prompts; a larger value makes the benchmark oversample (repeat) them.

Format (vLLM `custom` dataset)

JSONL, one request per line. Only prompt is read by the benchmark; the rest is provenance:

{"prompt": "...", "input_tokens": 8192, "target_isl": 8192,
 "_id": "...", "domain": "...", "sub_domain": "...", "difficulty": "...",
 "source_length": "long", "source_words": 232975}

Tokenizer: GLM-5 (/remote/vast0/share-mv/zai-org/GLM-5-FP8/tokenizer.json).
ISL is controlled by the dataset. Each prompt tokenizes to its target ISL (exact for 8k/10k, off-by-≤1 for 100k/1M at unavoidable BPE boundaries) under the GLM-5 tokenizer with no special tokens — i.e. exactly what the benchmark measures when --skip-chat-template is set.
OSL is NOT in the dataset. Set it at serve time with --custom-output-len.
Prompt body uses the official LongBench-v2 0-shot template (instruction + context + question + 4 choices); the context is head-truncated to hit the target ISL.

Quick guide

Both tools take the same dataset flags (vllm-moreh vendors vLLM's dataset loader and adds a few Moreh-only options). For every config you only change two things: --dataset-path (which ISL file) and --custom-output-len (the OSL).

Dataset	`--custom-output-len`	`--num-prompts`
`longbenchv2-8k.jsonl`	1024	≤ 256
`longbenchv2-10k.jsonl`	500	≤ 256
`longbenchv2-100k.jsonl`	500	≤ 100
`longbenchv2-1M.jsonl`	500	≤ 22

Three flags are required to get the intended behavior: --skip-chat-template (so prompt_len == ISL), --custom-output-len <OSL> (sets OSL), and --ignore-eos (forces the model to generate the full OSL).

A) `vllm bench serve`

DATA=/remote/vast0/share-mv/longbenchv2-custom
TOK=/remote/vast0/share-mv/zai-org/GLM-5-FP8

vllm bench serve \
    --backend vllm \
    --dataset-name custom \
    --dataset-path $DATA/longbenchv2-8k.jsonl \
    --skip-chat-template \
    --custom-output-len 1024 \
    --ignore-eos \
    --tokenizer $TOK \
    --model <served-model> \
    --base-url http://<host>:<port> --endpoint /v1/completions \
    --num-prompts 256 --max-concurrency <C>

B) `vllm-moreh bench serve`

Same flags; just swap the command. vllm-moreh adds Moreh-only options (--num-warmups, multi-value --base-url/--host/--port for PD-disaggregated setups, --profile).

DATA=/remote/vast0/share-mv/longbenchv2-custom
TOK=/remote/vast0/share-mv/zai-org/GLM-5-FP8

vllm-moreh bench serve \
    --backend vllm \
    --dataset-name custom \
    --dataset-path $DATA/longbenchv2-100k.jsonl \
    --skip-chat-template \
    --custom-output-len 500 \
    --ignore-eos \
    --tokenizer $TOK \
    --model <served-model> \
    --base-url http://<host>:<port> --endpoint /v1/completions \
    --num-prompts 100 --max-concurrency <C> \
    --num-warmups 8                      # Moreh-only: warmup requests before measuring

Notes / gotchas

--skip-chat-template matters. Without it, the prompt is wrapped in the served model's chat template, adding a handful of tokens, so prompt_len becomes ISL + template_overhead. Use /v1/completions + --skip-chat-template for exact ISL.
Tokenizer must match for exact ISL. ISL was measured with the GLM-5 tokenizer; benchmarking a model with a different tokenizer shifts the real prompt_len. Pass a matching --tokenizer, or regenerate with that model's tokenizer (below).
1M needs context room on the server: launch with --max-model-len ≥ ~1,000,500 (ISL + OSL + margin), and keep --num-prompts ≤ 22.
Why custom, not sharegpt: ShareGPTDataset hard-filters prompts to ≤1024 tokens, silently dropping every long-context sample. CustomDataset has no length filter.

Verify (no GPU needed)

Loads each file through the real CustomDataset and checks prompt_len == target ISL:

python3 /remote/vast0/share-mv/longbenchv2-custom/verify_dataset.py

Regenerate (different tokenizer / sizes)

python3 /remote/vast0/share-mv/longbenchv2-custom/sample_longbench_v2.py \
    --tokenizer /path/to/model_or_tokenizer.json \
    --output-dir /remote/vast0/share-mv/longbenchv2-custom

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongBench-v2 sampled datasets for `vllm bench serve` / `vllm-moreh bench serve`

Files

Format (vLLM `custom` dataset)

Quick guide

A) `vllm bench serve`

B) `vllm-moreh bench serve`

Notes / gotchas

Verify (no GPU needed)

Regenerate (different tokenizer / sizes)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
longbenchv2-100k.jsonl		longbenchv2-100k.jsonl
longbenchv2-10k.jsonl		longbenchv2-10k.jsonl
longbenchv2-1M.jsonl		longbenchv2-1M.jsonl
longbenchv2-8k.jsonl		longbenchv2-8k.jsonl
longbenchv2-manifest.json		longbenchv2-manifest.json
sample_longbench_v2.py		sample_longbench_v2.py
verify_dataset.py		verify_dataset.py

Folders and files

Latest commit

History

Repository files navigation

LongBench-v2 sampled datasets for vllm bench serve / vllm-moreh bench serve

Files

Format (vLLM custom dataset)

Quick guide

A) vllm bench serve

B) vllm-moreh bench serve

Notes / gotchas

Verify (no GPU needed)

Regenerate (different tokenizer / sizes)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

LongBench-v2 sampled datasets for `vllm bench serve` / `vllm-moreh bench serve`

Format (vLLM `custom` dataset)

A) `vllm bench serve`

B) `vllm-moreh bench serve`

Packages