Skip to content

KV-Cache eviction prioritization API#10

Draft
hyeongyun0916 wants to merge 20 commits into
mainfrom
retention-minimal
Draft

KV-Cache eviction prioritization API#10
hyeongyun0916 wants to merge 20 commits into
mainfrom
retention-minimal

Conversation

@hyeongyun0916

@hyeongyun0916 hyeongyun0916 commented May 21, 2026

Copy link
Copy Markdown

Purpose

This PR adds a minimal KV-cache eviction prioritization API to vLLM's chat completion endpoint, to support multi-turn agentic workloads where downstream orchestrators (NVIDIA Dynamo, llm-d) need explicit per-request retention hints.

The directive shape mirrors TensorRT-LLM's KvCacheRetentionConfig.TokenRangeRetentionConfig — same primitives: token range [start, end), priority (0-100), optional duration.
A single orchestrator can target both backends with the same payload semantics; only unit conversion (ms ↔ seconds) and one extra field (our retention_scope) differ.

The Dynamo blog uses example syntax matching this directly:

"system prompt blocks are evicted last (priority: 100); conversation context survives a 30-second tool call (duration: 45s); decode tokens are first to go (priority: 1)"

Implements RFC vllm-project#37003 ([RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions)).
This supersedes the existing draft PR vllm-project#38514 with the same public API, restructured to keep KVCacheBlock and Request completely untouched.
PR vllm-project#38514 will be closed once this lands; RFC vllm-project#37003 remains open for ongoing design discussion.

Public API

Two optional fields on ChatCompletionRequest:

retention_directives: list[dict] | None
# Each directive: {start: int, end: int|None, priority: int (0-100),
#                  duration: float|None}

retention_scope: str | None
# Opaque ownership identifier (e.g., session ID). Used to enforce
# directive-author ownership rules.

Forwarded to SamplingParams.extra_args and consumed by block_pool at the cache-full transition.
Validator enforces that priorities are non-increasing across rising token positions (prefix-cache constraint).

Behavior

  • Zero-overhead fast path: when no request has retention directives, get_new_blocks and get_num_free_blocks are bit-for-bit equivalent to the existing LRU path — a single integer comparison guards the branch.
  • Eviction order under pressure: unprioritized blocks drain from the LRU free list first; only after that, the lowest-priority blocks are popped from the priority queue.
  • Ownership: any scope can escalate priority; only the owning scope can downgrade or clear.
    Anonymous (scope=None) callers cannot clear another scope's directive.
  • Expiry: duration translates into a monotonic expiry timestamp; expired entries are silently treated as unprioritized.

What's changed

The change is purely additive at the API surface and at the data model:

  • KVCacheBlock is not modified (0 changes to kv_cache_utils.py)
  • Request is not modified (0 changes to request.py)
  • All retention state lives in a new self-contained module vllm/v1/core/priority_eviction_queue.py
  • block_pool.py integrates the new module via 8 single-line hook calls (+71 lines total)

Two structural-invariant regression tests are included so the "no core changes" property is automatically enforced against future commits.

benchmarks/benchmark_retention_eviction.py            (new)  +233
tests/tool_use/test_chat_completion_request_validations.py   +85
tests/v1/core/test_priority_eviction.py               (new) +558
vllm/entrypoints/openai/chat_completion/protocol.py         +53
vllm/v1/core/block_pool.py                                  +71 / −7
vllm/v1/core/priority_eviction_queue.py               (new) +160

Compared to vllm-project#38514:

Metric vllm-project#38514 This PR
Production lines added +380 +284
Core files modified 4 2
block_pool.py inline +174 +71
KVCacheBlock changes +17 0
Request changes +10 0

References

Test Plan

# Retention unit + integration tests
pytest tests/v1/core/test_priority_eviction.py

# Chat completion validator tests (retention fields + monotonic validator)
pytest tests/tool_use/test_chat_completion_request_validations.py

# End-to-end benchmark (no GPU required, runs against real BlockPool)
python benchmarks/benchmark_retention_eviction.py \
  --num-gpu-blocks 64 --num-sessions 4 --blocks-per-session 4

# Pre-commit on touched files
pre-commit run --files \
  vllm/v1/core/priority_eviction_queue.py \
  vllm/v1/core/block_pool.py \
  vllm/entrypoints/openai/chat_completion/protocol.py \
  tests/v1/core/test_priority_eviction.py \
  tests/tool_use/test_chat_completion_request_validations.py \
  benchmarks/benchmark_retention_eviction.py

# Strict mypy
pre-commit run mypy-3.10 --files \
  vllm/v1/core/priority_eviction_queue.py \
  vllm/v1/core/block_pool.py \
  vllm/entrypoints/openai/chat_completion/protocol.py \
  --hook-stage manual

# Broader regression
pytest tests/v1/core/

Test Result

  • tests/v1/core/test_priority_eviction.py40/40 pass
    • TestPriorityEvictionQueue (10): heap operations, lazy delete, TTL
    • TestApplyDirectives (11): overlap matching, ownership rules
    • TestSidecarLifecycle (6): cleanup invariants
    • TestBlockPoolPriorityEviction (7): integration tests
    • TestStructuralInvariants (2): KVCacheBlock / Request untouched
  • tests/tool_use/test_chat_completion_request_validations.py10/10 pass (3 pre-existing + 7 new for retention)
  • benchmarks/benchmark_retention_eviction.py — end-to-end PASS (prioritized blocks survive; unprioritized blocks evicted)
  • pre-commit run --files <all touched>clean
  • pre-commit run mypy-3.10 --hook-stage manualclean
  • Broader pytest tests/v1/core/305 pass; remaining 32 fail / 2 error reproduce identically on the PR base (verified in a separate worktree).
    Unrelated to this PR — test_scheduler.py / test_scheduler_e2e.py EC-connector and multimodal tests that depend on a model artifact not available in the CI environment.

Performance benchmark numbers will be added separately (re-running to ensure reproducibility).

Essential Elements of an Effective PR Description Checklist

Introduce a new self-contained module for priority-based KV-cache eviction.
Per-block retention metadata is stored in a side-table keyed by block_id,
keeping KVCacheBlock untouched.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
try_insert returns False for blocks without sidecar entries, enabling
single-line routing at the caller. pop_lowest consumes the sidecar
entry on eviction.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Lock in the heap-key ordering contract: (priority ASC, last_freed ASC,
block_id ASC). Lowest priority leaves first; equal-priority blocks
leave oldest-freed first.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
remove() drops from _in_queue without touching the heap; pop_lowest
skips entries whose block_id is no longer in _in_queue. Sidecar
metadata is preserved so priority survives a reuse cycle.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Entries whose monotonic expiry has elapsed are silently discarded by
pop_lowest. Callers receive None when only expired entries remain;
the block falls back to the LRU free list at the next allocation.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
For each full block, the highest-priority directive whose token range
overlaps the block's range wins. Open-ended ranges (end=None) cover
from start to end-of-sequence. Duration translates into a monotonic
expiry timestamp.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Escalation is open to any caller; downgrade and refresh are restricted
to the current owner; a no-match directive set from the owner's scope
clears the entry. Non-owner downgrades and anonymous (scope=None)
clears are silently ignored.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
clear_priority drops both sidecar and heap state for one block; clear
drops everything; __contains__ reports whether a block is currently
in the heap. Locks in the sidecar lifecycle invariants.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Add the queue as a member of BlockPool and clear it on
reset_prefix_cache. Two single-line hooks; no changes to KVCacheBlock.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Reads retention_directives and retention_scope from extra_args and
forwards them to the priority eviction queue. Returns early when
neither is present, preserving the existing zero-overhead path.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Fast path is bit-for-bit equivalent to LRU when the priority queue is
empty (single integer compare). Otherwise: drain LRU first, then pop
lowest-priority blocks from the queue. get_num_free_blocks now sums
both pools. reset_prefix_cache updated to use free_block_queue.num_free_blocks
directly (pre-Task-13 guard against double-counting with sidecar).

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…llocation

The earlier priority-aware get_num_free_blocks rewrite (sum of LRU
and priority queue) required a reset_prefix_cache change to keep an
earlier eviction-queue test green. That side-effect is reverted here;
the test is instead adjusted to remove the block from the LRU before
adding it to the priority queue, matching the single-queue invariant
enforced later when touch() is updated. Keeps reset_prefix_cache
correct in production after full integration.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Single 3-line dispatch in touch(): if the reused block is in the
priority queue, remove from there; otherwise remove from the LRU free
list (existing behavior).

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Single-line filter: blocks for which try_insert returns False fall
through to the LRU free list (existing path). try_insert stamps the
current monotonic time so the heap tiebreak reflects this most-recent
free.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Single-line hook in the evict_blocks per-block loop. Ensures the
sidecar dict tracks only blocks that are currently in the prefix cache.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Two new Pydantic fields on ChatCompletionRequest, forwarded into
SamplingParams.extra_args. Pure addition — no existing-line changes.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Across rising token positions, priorities must be non-increasing
(prefix-cache constraint: cached prefixes are shared across requests,
so later token spans cannot retain blocks that earlier spans do not).
Validator runs before model construction; sorts by start, then checks.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Two regression tests prevent KVCacheBlock or Request from accidentally
gaining retention-specific fields in future commits. They make the
sidecar contract self-enforcing.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Adapt the existing benchmark to read priority via the queue API rather
than KVCacheBlock fields.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@hyeongyun0916 hyeongyun0916 changed the title Retention minimal KV-Cache eviction prioritization API May 21, 2026
…ock generation counter

PriorityEvictionQueue holds freed blocks in a min-heap keyed by
(priority, last_freed_time). A remove()+re-free reuse cycle pushed a
second tuple for the same block_id without removing the prior one, so the
heap could carry stale tuples with an outdated (priority, last_freed_time).
pop_lowest could then evict using a stale tuple — e.g. a block re-protected
at priority 90 evicted as if still priority 50 — inverting the intended
eviction order and silently violating retention directives.
Fix: stamp every pushed tuple with a per-block monotonic generation
(_gen[block_id], bumped on each try_insert) and have pop_lowest skip any
popped tuple whose generation no longer matches the block's current
generation (lazy deletion). Eviction now always orders by the block's
CURRENT (priority, last_freed_time).
Bundled (entangled in the same files):
- drain_expired(): demote expired sidecar entries to the LRU free list
  rather than letting them top the next pop_lowest (was destroying
  prefix-cache hit rate under TTL'd protection).
- try_insert routes below-threshold sidecar entries to the LRU instead
  of the priority queue.
Tests: stale-tuple/inversion + threshold-routing cases in
test_priority_eviction.py; conftest lowers _PRIORITY_THRESHOLD to 0 for
the dir's priority=50 convention.

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant