KV-Cache eviction prioritization API by hyeongyun0916 · Pull Request #10 · moreh-dev/vllm

hyeongyun0916 · 2026-05-21T14:10:22Z

Purpose

This PR adds a minimal KV-cache eviction prioritization API to vLLM's chat completion endpoint, to support multi-turn agentic workloads where downstream orchestrators (NVIDIA Dynamo, llm-d) need explicit per-request retention hints.

The directive shape mirrors TensorRT-LLM's KvCacheRetentionConfig.TokenRangeRetentionConfig — same primitives: token range [start, end), priority (0-100), optional duration.
A single orchestrator can target both backends with the same payload semantics; only unit conversion (ms ↔ seconds) and one extra field (our retention_scope) differ.

The Dynamo blog uses example syntax matching this directly:

"system prompt blocks are evicted last (priority: 100); conversation context survives a 30-second tool call (duration: 45s); decode tokens are first to go (priority: 1)"

Implements RFC vllm-project#37003 ([RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions)).
This supersedes the existing draft PR vllm-project#38514 with the same public API, restructured to keep KVCacheBlock and Request completely untouched.
PR vllm-project#38514 will be closed once this lands; RFC vllm-project#37003 remains open for ongoing design discussion.

Public API

Two optional fields on ChatCompletionRequest:

retention_directives: list[dict] | None
# Each directive: {start: int, end: int|None, priority: int (0-100),
#                  duration: float|None}

retention_scope: str | None
# Opaque ownership identifier (e.g., session ID). Used to enforce
# directive-author ownership rules.

Forwarded to SamplingParams.extra_args and consumed by block_pool at the cache-full transition.
Validator enforces that priorities are non-increasing across rising token positions (prefix-cache constraint).

Behavior

Zero-overhead fast path: when no request has retention directives, get_new_blocks and get_num_free_blocks are bit-for-bit equivalent to the existing LRU path — a single integer comparison guards the branch.
Eviction order under pressure: unprioritized blocks drain from the LRU free list first; only after that, the lowest-priority blocks are popped from the priority queue.
Ownership: any scope can escalate priority; only the owning scope can downgrade or clear.
Anonymous (scope=None) callers cannot clear another scope's directive.
Expiry: duration translates into a monotonic expiry timestamp; expired entries are silently treated as unprioritized.

What's changed

The change is purely additive at the API surface and at the data model:

KVCacheBlock is not modified (0 changes to kv_cache_utils.py)
Request is not modified (0 changes to request.py)
All retention state lives in a new self-contained module vllm/v1/core/priority_eviction_queue.py
block_pool.py integrates the new module via 8 single-line hook calls (+71 lines total)

Two structural-invariant regression tests are included so the "no core changes" property is automatically enforced against future commits.

benchmarks/benchmark_retention_eviction.py            (new)  +233
tests/tool_use/test_chat_completion_request_validations.py   +85
tests/v1/core/test_priority_eviction.py               (new) +558
vllm/entrypoints/openai/chat_completion/protocol.py         +53
vllm/v1/core/block_pool.py                                  +71 / −7
vllm/v1/core/priority_eviction_queue.py               (new) +160

Compared to vllm-project#38514:

Metric	vllm-project#38514	This PR
Production lines added	+380	+284
Core files modified	4	2
`block_pool.py` inline	+174	+71
`KVCacheBlock` changes	+17	0
`Request` changes	+10	0

References

TensorRT-LLM KvCacheRetentionConfig API: https://nvidia.github.io/TensorRT-LLM/latest/features/kvcache.html
NVIDIA Dynamo, full-stack agentic inference optimizations: https://developer.nvidia.com/blog/full-stack-optimizations-for-agentic-inference-with-nvidia-dynamo/
Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live — https://arxiv.org/abs/2511.02230
Implements RFC [RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions) vllm-project/vllm#37003. Supersedes draft PR [Feature] Context-Aware KV-Cache Retention API (#37003) vllm-project/vllm#38514 (same public API, minimized surface).

Test Plan

# Retention unit + integration tests
pytest tests/v1/core/test_priority_eviction.py

# Chat completion validator tests (retention fields + monotonic validator)
pytest tests/tool_use/test_chat_completion_request_validations.py

# End-to-end benchmark (no GPU required, runs against real BlockPool)
python benchmarks/benchmark_retention_eviction.py \
  --num-gpu-blocks 64 --num-sessions 4 --blocks-per-session 4

# Pre-commit on touched files
pre-commit run --files \
  vllm/v1/core/priority_eviction_queue.py \
  vllm/v1/core/block_pool.py \
  vllm/entrypoints/openai/chat_completion/protocol.py \
  tests/v1/core/test_priority_eviction.py \
  tests/tool_use/test_chat_completion_request_validations.py \
  benchmarks/benchmark_retention_eviction.py

# Strict mypy
pre-commit run mypy-3.10 --files \
  vllm/v1/core/priority_eviction_queue.py \
  vllm/v1/core/block_pool.py \
  vllm/entrypoints/openai/chat_completion/protocol.py \
  --hook-stage manual

# Broader regression
pytest tests/v1/core/

Test Result

tests/v1/core/test_priority_eviction.py — 40/40 pass
- TestPriorityEvictionQueue (10): heap operations, lazy delete, TTL
- TestApplyDirectives (11): overlap matching, ownership rules
- TestSidecarLifecycle (6): cleanup invariants
- TestBlockPoolPriorityEviction (7): integration tests
- TestStructuralInvariants (2): KVCacheBlock / Request untouched
tests/tool_use/test_chat_completion_request_validations.py — 10/10 pass (3 pre-existing + 7 new for retention)
benchmarks/benchmark_retention_eviction.py — end-to-end PASS (prioritized blocks survive; unprioritized blocks evicted)
pre-commit run --files <all touched> — clean
pre-commit run mypy-3.10 --hook-stage manual — clean
Broader pytest tests/v1/core/ — 305 pass; remaining 32 fail / 2 error reproduce identically on the PR base (verified in a separate worktree).
Unrelated to this PR — test_scheduler.py / test_scheduler_e2e.py EC-connector and multimodal tests that depend on a model artifact not available in the CI environment.

Performance benchmark numbers will be added separately (re-running to ensure reproducibility).

Essential Elements of an Effective PR Description Checklist

The purpose of the PR — see Purpose section above. Implements RFC [RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions) vllm-project/vllm#37003, supersedes draft PR [Feature] Context-Aware KV-Cache Retention API (#37003) vllm-project/vllm#38514.
The test plan — see Test Plan section above.
The test results — see Test Result section above.
Performance numbers are being re-measured and will be added in a follow-up.
(Optional) The necessary documentation update — N/A (no new model or example).

Introduce a new self-contained module for priority-based KV-cache eviction. Per-block retention metadata is stored in a side-table keyed by block_id, keeping KVCacheBlock untouched. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

try_insert returns False for blocks without sidecar entries, enabling single-line routing at the caller. pop_lowest consumes the sidecar entry on eviction. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Lock in the heap-key ordering contract: (priority ASC, last_freed ASC, block_id ASC). Lowest priority leaves first; equal-priority blocks leave oldest-freed first. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

remove() drops from _in_queue without touching the heap; pop_lowest skips entries whose block_id is no longer in _in_queue. Sidecar metadata is preserved so priority survives a reuse cycle. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Entries whose monotonic expiry has elapsed are silently discarded by pop_lowest. Callers receive None when only expired entries remain; the block falls back to the LRU free list at the next allocation. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

For each full block, the highest-priority directive whose token range overlaps the block's range wins. Open-ended ranges (end=None) cover from start to end-of-sequence. Duration translates into a monotonic expiry timestamp. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Escalation is open to any caller; downgrade and refresh are restricted to the current owner; a no-match directive set from the owner's scope clears the entry. Non-owner downgrades and anonymous (scope=None) clears are silently ignored. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

clear_priority drops both sidecar and heap state for one block; clear drops everything; __contains__ reports whether a block is currently in the heap. Locks in the sidecar lifecycle invariants. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Add the queue as a member of BlockPool and clear it on reset_prefix_cache. Two single-line hooks; no changes to KVCacheBlock. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Reads retention_directives and retention_scope from extra_args and forwards them to the priority eviction queue. Returns early when neither is present, preserving the existing zero-overhead path. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Fast path is bit-for-bit equivalent to LRU when the priority queue is empty (single integer compare). Otherwise: drain LRU first, then pop lowest-priority blocks from the queue. get_num_free_blocks now sums both pools. reset_prefix_cache updated to use free_block_queue.num_free_blocks directly (pre-Task-13 guard against double-counting with sidecar). Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

…llocation The earlier priority-aware get_num_free_blocks rewrite (sum of LRU and priority queue) required a reset_prefix_cache change to keep an earlier eviction-queue test green. That side-effect is reverted here; the test is instead adjusted to remove the block from the LRU before adding it to the priority queue, matching the single-queue invariant enforced later when touch() is updated. Keeps reset_prefix_cache correct in production after full integration. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Single 3-line dispatch in touch(): if the reused block is in the priority queue, remove from there; otherwise remove from the LRU free list (existing behavior). Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Single-line filter: blocks for which try_insert returns False fall through to the LRU free list (existing path). try_insert stamps the current monotonic time so the heap tiebreak reflects this most-recent free. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Single-line hook in the evict_blocks per-block loop. Ensures the sidecar dict tracks only blocks that are currently in the prefix cache. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Two new Pydantic fields on ChatCompletionRequest, forwarded into SamplingParams.extra_args. Pure addition — no existing-line changes. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Across rising token positions, priorities must be non-increasing (prefix-cache constraint: cached prefixes are shared across requests, so later token spans cannot retain blocks that earlier spans do not). Validator runs before model construction; sorts by start, then checks. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Two regression tests prevent KVCacheBlock or Request from accidentally gaining retention-specific fields in future commits. They make the sidecar contract self-enforcing. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Adapt the existing benchmark to read priority via the queue API rather than KVCacheBlock fields. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

…ock generation counter PriorityEvictionQueue holds freed blocks in a min-heap keyed by (priority, last_freed_time). A remove()+re-free reuse cycle pushed a second tuple for the same block_id without removing the prior one, so the heap could carry stale tuples with an outdated (priority, last_freed_time). pop_lowest could then evict using a stale tuple — e.g. a block re-protected at priority 90 evicted as if still priority 50 — inverting the intended eviction order and silently violating retention directives. Fix: stamp every pushed tuple with a per-block monotonic generation (_gen[block_id], bumped on each try_insert) and have pop_lowest skip any popped tuple whose generation no longer matches the block's current generation (lazy deletion). Eviction now always orders by the block's CURRENT (priority, last_freed_time). Bundled (entangled in the same files): - drain_expired(): demote expired sidecar entries to the LRU free list rather than letting them top the next pop_lowest (was destroying prefix-cache hit rate under TTL'd protection). - try_insert routes below-threshold sidecar entries to the LRU instead of the priority queue. Tests: stale-tuple/inversion + threshold-routing cases in test_priority_eviction.py; conftest lowers _PRIORITY_THRESHOLD to 0 for the dir's priority=50 convention. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

hyeongyun0916 added 19 commits May 21, 2026 22:40

Retention API: implement try_insert and pop_lowest with min-heap

11b4836

try_insert returns False for blocks without sidecar entries, enabling single-line routing at the caller. pop_lowest consumes the sidecar entry on eviction. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Retention API: test eviction order by priority and freed-time tiebreak

5e6247d

Lock in the heap-key ordering contract: (priority ASC, last_freed ASC, block_id ASC). Lowest priority leaves first; equal-priority blocks leave oldest-freed first. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Retention API: wire PriorityEvictionQueue into BlockPool

55a1a39

Add the queue as a member of BlockPool and clear it on reset_prefix_cache. Two single-line hooks; no changes to KVCacheBlock. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Retention API: touch removes from priority queue when block was there

dbbeae2

Single 3-line dispatch in touch(): if the reused block is in the priority queue, remove from there; otherwise remove from the LRU free list (existing behavior). Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Retention API: clear sidecar on evict_blocks

b10c395

Single-line hook in the evict_blocks per-block loop. Ensures the sidecar dict tracks only blocks that are currently in the prefix cache. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Retention API: add retention_directives + retention_scope chat fields

4f93efb

Two new Pydantic fields on ChatCompletionRequest, forwarded into SamplingParams.extra_args. Pure addition — no existing-line changes. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Retention API: structural-invariant tests for sidecar pattern

e582d5f

Two regression tests prevent KVCacheBlock or Request from accidentally gaining retention-specific fields in future commits. They make the sidecar contract self-enforcing. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

Retention API: port throughput benchmark with sidecar access patterns

85a85f1

Adapt the existing benchmark to read priority via the queue API rather than KVCacheBlock fields. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>

gitgod-bot assigned hyeongyun0916 May 21, 2026

hyeongyun0916 changed the title ~~Retention minimal~~ KV-Cache eviction prioritization API May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV-Cache eviction prioritization API#10

KV-Cache eviction prioritization API#10
hyeongyun0916 wants to merge 20 commits into
mainfrom
retention-minimal

hyeongyun0916 commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hyeongyun0916 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Public API

Behavior

What's changed

References

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hyeongyun0916 commented May 21, 2026 •

edited

Loading