MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference

MPCache is accepted by NeurIPS '25. In this work, we follow the framework of LongBench to build MPCache.

Abstract

Private LLM inference based on multi-party computation (MPC) offers cryptographically-secure protection for both user prompt and proprietary model weights. However, it suffers from large latency overhead for long input sequences. While key-value (KV) cache eviction algorithms have been proposed to reduce the computation and memory cost for plaintext inference, they are not designed for MPC and may even introduce more overhead. In this paper, we propose an accurate and MPC-friendly KV cache eviction framework, dubbed MPCache. MPCache is built on the observation that historical tokens in a long sequence may have different effects on the downstream decoding. Hence, MPCache combines a look-once static eviction algorithm to discard unimportant tokens and a query-aware dynamic selection algorithm to further choose a small subset of tokens for attention computation. As existing dynamic selection algorithms incur too much latency, we propose a series of optimizations to drastically reduce the KV cache selection overhead, including MPC-friendly similarity approximation, hierarchical KV cache clustering, and layer-wise index sharing strategy.

Dataset Preparation

You can download and load the LongBench dataset through the Huggingface datasets (HF Repo):

from datasets import load_dataset

datasets = ["narrativeqa", "qasper", "multifieldqa_en", "multifieldqa_zh", "hotpotqa", "2wikimqa", "musique", \
            "dureader", "gov_report", "qmsum", "multi_news", "vcsum", "trec", "triviaqa", "samsum", "lsht", \
            "passage_count", "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]

for dataset in datasets:
    data = load_dataset('THUDM/LongBench', dataset, split='test')

You can also download the datasets from the website this link.

Evaluation

Packages and environment: Python version is 3.10. Install the required packages with pip tool:

pip install -r requirements.txt

Note that the package versions are important. For LLM inference on long contexts, we follow the optimization of FlashAttention during the prefill stage for saving GPU memory. The relevant dependencies can be installed according to the codebase of FlashAttention.

Dataset choice:

To evaluate a specific dataset, we can modify the following code in pred_mine.py (we choose the hotpotqa dataset as an example):

datasets = ["hotpotqa"]  # choose the dataset

Model file and configuration:

The core code of KV cache eviction is are implemented in llama_flash_attn_monkey_patch_compression.py.

For hierarchical clustering, alpha controls the ratio between $\mathbf r^{\min}$ and $\mathbf r^{\max}$ ($\alpha$ is set to 0.6 by default). cluster_size1 and cluster_size2 control the granularities of two hierarchical levels (set cluster_size1=32 and cluster_size2=16 by default). ratio1 and ratio2 control the dynamic selection ratio at the 1st hierarchy level and the overall dynamic selection ratio, respectively.

The tools are implemented in utils.py. For example, group_key_min_max computes $\mathbf r^{\min}$ and $\mathbf r^{\max}$ for each cluster. groupidx_to_tokenidx converts the selected group indices into the corresponding token indices.

The core algorithm of similarity approximation is shown below:

Model inference and evaluation:

First, run pred_mine.py to perform the model inference on longchat-v1.5-7b-32k:

CUDA_VISIBLE_DEVICES=0 python pred_mine.py --model longchat-v1.5-7b-32k

You can also run inference on multi-gpus in parallel (one model per gpu):

CUDA_VISIBLE_DEVICES=0,1,2,3 python pred_mine.py --model longchat-v1.5-7b-32k

Then, you can obtain the inference output of the model on the dataset under the pred_mine/ folder corresponding to the model name.

After the inference, run eval_mine.py to evaluate the model performance (no need of GPU):

python eval_mine.py --model longchat-v1.5-7b-32k

We can get the results on the datasets in result.json.

For MPC inference, we use Secretflow with SPU to evaluate the efficiency, which can be found in infer/

Reference

@article{zeng2025mpcache,
  title={MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference},
  author={Zeng, Wenxuan and Dong, Ye and Zhou, Jinjin and Ma, Junming and Tan, Jin and Wang, Runsheng and Li, Meng},
  journal={arXiv preprint arXiv:2501.06807},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
__pycache__		__pycache__
config		config
imgs		imgs
infer		infer
pred_mine/longchat-v1.5-7b-32k		pred_mine/longchat-v1.5-7b-32k
retrieval		retrieval
summ		summ
.bashrc		.bashrc
.gitignore		.gitignore
.profile		.profile
.viminfo		.viminfo
LICENSE		LICENSE
README.md		README.md
download_datasets.py		download_datasets.py
eval.py		eval.py
eval_mine.py		eval_mine.py
llama_flash_attn_monkey_patch.py		llama_flash_attn_monkey_patch.py
llama_flash_attn_monkey_patch_compression.py		llama_flash_attn_monkey_patch_compression.py
metrics.py		metrics.py
pred.py		pred.py
pred_mine.py		pred_mine.py
requirements.txt		requirements.txt
task.md		task.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference

Abstract

Dataset Preparation

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference

Abstract

Dataset Preparation

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages