Skip to content

moreh-dev/longbenchv2-custom

Repository files navigation

LongBench-v2 sampled datasets for vllm bench serve / vllm-moreh bench serve

Real long-context prompts sampled from LongBench-v2 (/remote/vast0/share-mv/zai-org/LongBench-v2/data.json) for serving benchmarks.

Location: /remote/vast0/share-mv/longbenchv2-custom/ Generated by: sample_longbench_v2.py (in this dir).

Files

File Target ISL (tokens) Prompts Recommended OSL
longbenchv2-8k.jsonl 8,192 256 1024
longbenchv2-10k.jsonl 10,000 256 500
longbenchv2-100k.jsonl 100,000 100 500
longbenchv2-1M.jsonl 1,000,000 22 500

longbenchv2-manifest.json records the achieved token range per file.

The 1M file has only 22 prompts — that is every LongBench-v2 entry whose real context reaches 1,000,000 GLM-5 tokens (no synthetic padding / repetition). Keep --num-prompts <= 22 to use only unique prompts; a larger value makes the benchmark oversample (repeat) them.

Format (vLLM custom dataset)

JSONL, one request per line. Only prompt is read by the benchmark; the rest is provenance:

{"prompt": "...", "input_tokens": 8192, "target_isl": 8192,
 "_id": "...", "domain": "...", "sub_domain": "...", "difficulty": "...",
 "source_length": "long", "source_words": 232975}
  • Tokenizer: GLM-5 (/remote/vast0/share-mv/zai-org/GLM-5-FP8/tokenizer.json).
  • ISL is controlled by the dataset. Each prompt tokenizes to its target ISL (exact for 8k/10k, off-by-≤1 for 100k/1M at unavoidable BPE boundaries) under the GLM-5 tokenizer with no special tokens — i.e. exactly what the benchmark measures when --skip-chat-template is set.
  • OSL is NOT in the dataset. Set it at serve time with --custom-output-len.
  • Prompt body uses the official LongBench-v2 0-shot template (instruction + context + question + 4 choices); the context is head-truncated to hit the target ISL.

Quick guide

Both tools take the same dataset flags (vllm-moreh vendors vLLM's dataset loader and adds a few Moreh-only options). For every config you only change two things: --dataset-path (which ISL file) and --custom-output-len (the OSL).

Dataset --custom-output-len --num-prompts
longbenchv2-8k.jsonl 1024 ≤ 256
longbenchv2-10k.jsonl 500 ≤ 256
longbenchv2-100k.jsonl 500 ≤ 100
longbenchv2-1M.jsonl 500 ≤ 22

Three flags are required to get the intended behavior: --skip-chat-template (so prompt_len == ISL), --custom-output-len <OSL> (sets OSL), and --ignore-eos (forces the model to generate the full OSL).

A) vllm bench serve

DATA=/remote/vast0/share-mv/longbenchv2-custom
TOK=/remote/vast0/share-mv/zai-org/GLM-5-FP8

vllm bench serve \
    --backend vllm \
    --dataset-name custom \
    --dataset-path $DATA/longbenchv2-8k.jsonl \
    --skip-chat-template \
    --custom-output-len 1024 \
    --ignore-eos \
    --tokenizer $TOK \
    --model <served-model> \
    --base-url http://<host>:<port> --endpoint /v1/completions \
    --num-prompts 256 --max-concurrency <C>

B) vllm-moreh bench serve

Same flags; just swap the command. vllm-moreh adds Moreh-only options (--num-warmups, multi-value --base-url/--host/--port for PD-disaggregated setups, --profile).

DATA=/remote/vast0/share-mv/longbenchv2-custom
TOK=/remote/vast0/share-mv/zai-org/GLM-5-FP8

vllm-moreh bench serve \
    --backend vllm \
    --dataset-name custom \
    --dataset-path $DATA/longbenchv2-100k.jsonl \
    --skip-chat-template \
    --custom-output-len 500 \
    --ignore-eos \
    --tokenizer $TOK \
    --model <served-model> \
    --base-url http://<host>:<port> --endpoint /v1/completions \
    --num-prompts 100 --max-concurrency <C> \
    --num-warmups 8                      # Moreh-only: warmup requests before measuring

Notes / gotchas

  • --skip-chat-template matters. Without it, the prompt is wrapped in the served model's chat template, adding a handful of tokens, so prompt_len becomes ISL + template_overhead. Use /v1/completions + --skip-chat-template for exact ISL.
  • Tokenizer must match for exact ISL. ISL was measured with the GLM-5 tokenizer; benchmarking a model with a different tokenizer shifts the real prompt_len. Pass a matching --tokenizer, or regenerate with that model's tokenizer (below).
  • 1M needs context room on the server: launch with --max-model-len ≥ ~1,000,500 (ISL + OSL + margin), and keep --num-prompts ≤ 22.
  • Why custom, not sharegpt: ShareGPTDataset hard-filters prompts to ≤1024 tokens, silently dropping every long-context sample. CustomDataset has no length filter.

Verify (no GPU needed)

Loads each file through the real CustomDataset and checks prompt_len == target ISL:

python3 /remote/vast0/share-mv/longbenchv2-custom/verify_dataset.py

Regenerate (different tokenizer / sizes)

python3 /remote/vast0/share-mv/longbenchv2-custom/sample_longbench_v2.py \
    --tokenizer /path/to/model_or_tokenizer.json \
    --output-dir /remote/vast0/share-mv/longbenchv2-custom

About

Long context dataset sampled from original zai-org/LongBench-v2

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages