KV-Cache eviction prioritization API#10
Draft
hyeongyun0916 wants to merge 20 commits into
Draft
Conversation
Introduce a new self-contained module for priority-based KV-cache eviction. Per-block retention metadata is stored in a side-table keyed by block_id, keeping KVCacheBlock untouched. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
try_insert returns False for blocks without sidecar entries, enabling single-line routing at the caller. pop_lowest consumes the sidecar entry on eviction. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Lock in the heap-key ordering contract: (priority ASC, last_freed ASC, block_id ASC). Lowest priority leaves first; equal-priority blocks leave oldest-freed first. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
remove() drops from _in_queue without touching the heap; pop_lowest skips entries whose block_id is no longer in _in_queue. Sidecar metadata is preserved so priority survives a reuse cycle. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Entries whose monotonic expiry has elapsed are silently discarded by pop_lowest. Callers receive None when only expired entries remain; the block falls back to the LRU free list at the next allocation. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
For each full block, the highest-priority directive whose token range overlaps the block's range wins. Open-ended ranges (end=None) cover from start to end-of-sequence. Duration translates into a monotonic expiry timestamp. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Escalation is open to any caller; downgrade and refresh are restricted to the current owner; a no-match directive set from the owner's scope clears the entry. Non-owner downgrades and anonymous (scope=None) clears are silently ignored. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
clear_priority drops both sidecar and heap state for one block; clear drops everything; __contains__ reports whether a block is currently in the heap. Locks in the sidecar lifecycle invariants. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Add the queue as a member of BlockPool and clear it on reset_prefix_cache. Two single-line hooks; no changes to KVCacheBlock. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Reads retention_directives and retention_scope from extra_args and forwards them to the priority eviction queue. Returns early when neither is present, preserving the existing zero-overhead path. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Fast path is bit-for-bit equivalent to LRU when the priority queue is empty (single integer compare). Otherwise: drain LRU first, then pop lowest-priority blocks from the queue. get_num_free_blocks now sums both pools. reset_prefix_cache updated to use free_block_queue.num_free_blocks directly (pre-Task-13 guard against double-counting with sidecar). Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…llocation The earlier priority-aware get_num_free_blocks rewrite (sum of LRU and priority queue) required a reset_prefix_cache change to keep an earlier eviction-queue test green. That side-effect is reverted here; the test is instead adjusted to remove the block from the LRU before adding it to the priority queue, matching the single-queue invariant enforced later when touch() is updated. Keeps reset_prefix_cache correct in production after full integration. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Single 3-line dispatch in touch(): if the reused block is in the priority queue, remove from there; otherwise remove from the LRU free list (existing behavior). Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Single-line filter: blocks for which try_insert returns False fall through to the LRU free list (existing path). try_insert stamps the current monotonic time so the heap tiebreak reflects this most-recent free. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Single-line hook in the evict_blocks per-block loop. Ensures the sidecar dict tracks only blocks that are currently in the prefix cache. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Two new Pydantic fields on ChatCompletionRequest, forwarded into SamplingParams.extra_args. Pure addition — no existing-line changes. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Across rising token positions, priorities must be non-increasing (prefix-cache constraint: cached prefixes are shared across requests, so later token spans cannot retain blocks that earlier spans do not). Validator runs before model construction; sorts by start, then checks. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Two regression tests prevent KVCacheBlock or Request from accidentally gaining retention-specific fields in future commits. They make the sidecar contract self-enforcing. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Adapt the existing benchmark to read priority via the queue API rather than KVCacheBlock fields. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…ock generation counter PriorityEvictionQueue holds freed blocks in a min-heap keyed by (priority, last_freed_time). A remove()+re-free reuse cycle pushed a second tuple for the same block_id without removing the prior one, so the heap could carry stale tuples with an outdated (priority, last_freed_time). pop_lowest could then evict using a stale tuple — e.g. a block re-protected at priority 90 evicted as if still priority 50 — inverting the intended eviction order and silently violating retention directives. Fix: stamp every pushed tuple with a per-block monotonic generation (_gen[block_id], bumped on each try_insert) and have pop_lowest skip any popped tuple whose generation no longer matches the block's current generation (lazy deletion). Eviction now always orders by the block's CURRENT (priority, last_freed_time). Bundled (entangled in the same files): - drain_expired(): demote expired sidecar entries to the LRU free list rather than letting them top the next pop_lowest (was destroying prefix-cache hit rate under TTL'd protection). - try_insert routes below-threshold sidecar entries to the LRU instead of the priority queue. Tests: stale-tuple/inversion + threshold-routing cases in test_priority_eviction.py; conftest lowers _PRIORITY_THRESHOLD to 0 for the dir's priority=50 convention. Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR adds a minimal KV-cache eviction prioritization API to vLLM's chat completion endpoint, to support multi-turn agentic workloads where downstream orchestrators (NVIDIA Dynamo, llm-d) need explicit per-request retention hints.
The directive shape mirrors TensorRT-LLM's
KvCacheRetentionConfig.TokenRangeRetentionConfig— same primitives: token range[start, end),priority(0-100), optionalduration.A single orchestrator can target both backends with the same payload semantics; only unit conversion (ms ↔ seconds) and one extra field (our
retention_scope) differ.The Dynamo blog uses example syntax matching this directly:
Implements RFC vllm-project#37003 ([RFC]: Context-Aware KV-Cache Retention API (Prioritized Evictions)).
This supersedes the existing draft PR vllm-project#38514 with the same public API, restructured to keep
KVCacheBlockandRequestcompletely untouched.PR vllm-project#38514 will be closed once this lands; RFC vllm-project#37003 remains open for ongoing design discussion.
Public API
Two optional fields on
ChatCompletionRequest:Forwarded to
SamplingParams.extra_argsand consumed byblock_poolat the cache-full transition.Validator enforces that priorities are non-increasing across rising token positions (prefix-cache constraint).
Behavior
get_new_blocksandget_num_free_blocksare bit-for-bit equivalent to the existing LRU path — a single integer comparison guards the branch.Anonymous (
scope=None) callers cannot clear another scope's directive.durationtranslates into a monotonic expiry timestamp; expired entries are silently treated as unprioritized.What's changed
The change is purely additive at the API surface and at the data model:
KVCacheBlockis not modified (0 changes tokv_cache_utils.py)Requestis not modified (0 changes torequest.py)vllm/v1/core/priority_eviction_queue.pyblock_pool.pyintegrates the new module via 8 single-line hook calls (+71 lines total)Two structural-invariant regression tests are included so the "no core changes" property is automatically enforced against future commits.
Compared to vllm-project#38514:
block_pool.pyinlineKVCacheBlockchangesRequestchangesReferences
KvCacheRetentionConfigAPI: https://nvidia.github.io/TensorRT-LLM/latest/features/kvcache.htmlTest Plan
Test Result
tests/v1/core/test_priority_eviction.py— 40/40 passTestPriorityEvictionQueue(10): heap operations, lazy delete, TTLTestApplyDirectives(11): overlap matching, ownership rulesTestSidecarLifecycle(6): cleanup invariantsTestBlockPoolPriorityEviction(7): integration testsTestStructuralInvariants(2):KVCacheBlock/Requestuntouchedtests/tool_use/test_chat_completion_request_validations.py— 10/10 pass (3 pre-existing + 7 new for retention)benchmarks/benchmark_retention_eviction.py— end-to-end PASS (prioritized blocks survive; unprioritized blocks evicted)pre-commit run --files <all touched>— cleanpre-commit run mypy-3.10 --hook-stage manual— cleanpytest tests/v1/core/— 305 pass; remaining 32 fail / 2 error reproduce identically on the PR base (verified in a separate worktree).Unrelated to this PR —
test_scheduler.py/test_scheduler_e2e.pyEC-connector and multimodal tests that depend on a model artifact not available in the CI environment.Performance benchmark numbers will be added separately (re-running to ensure reproducibility).
Essential Elements of an Effective PR Description Checklist
Performance numbers are being re-measured and will be added in a follow-up.