Skip to content

zwxandy/MPCache

Repository files navigation

MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference

MPCache is accepted by NeurIPS '25. In this work, we follow the framework of LongBench to build MPCache.

Description

Abstract

Private LLM inference based on multi-party computation (MPC) offers cryptographically-secure protection for both user prompt and proprietary model weights. However, it suffers from large latency overhead for long input sequences. While key-value (KV) cache eviction algorithms have been proposed to reduce the computation and memory cost for plaintext inference, they are not designed for MPC and may even introduce more overhead. In this paper, we propose an accurate and MPC-friendly KV cache eviction framework, dubbed MPCache. MPCache is built on the observation that historical tokens in a long sequence may have different effects on the downstream decoding. Hence, MPCache combines a look-once static eviction algorithm to discard unimportant tokens and a query-aware dynamic selection algorithm to further choose a small subset of tokens for attention computation. As existing dynamic selection algorithms incur too much latency, we propose a series of optimizations to drastically reduce the KV cache selection overhead, including MPC-friendly similarity approximation, hierarchical KV cache clustering, and layer-wise index sharing strategy.

Dataset Preparation

You can download and load the LongBench dataset through the Huggingface datasets (HF Repo):

from datasets import load_dataset

datasets = ["narrativeqa", "qasper", "multifieldqa_en", "multifieldqa_zh", "hotpotqa", "2wikimqa", "musique", \
            "dureader", "gov_report", "qmsum", "multi_news", "vcsum", "trec", "triviaqa", "samsum", "lsht", \
            "passage_count", "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"]

for dataset in datasets:
    data = load_dataset('THUDM/LongBench', dataset, split='test')

You can also download the datasets from the website this link.

Evaluation

Packages and environment: Python version is 3.10. Install the required packages with pip tool:

pip install -r requirements.txt

Note that the package versions are important. For LLM inference on long contexts, we follow the optimization of FlashAttention during the prefill stage for saving GPU memory. The relevant dependencies can be installed according to the codebase of FlashAttention.

Dataset choice:

To evaluate a specific dataset, we can modify the following code in pred_mine.py (we choose the hotpotqa dataset as an example):

datasets = ["hotpotqa"]  # choose the dataset

Model file and configuration:

The core code of KV cache eviction is are implemented in llama_flash_attn_monkey_patch_compression.py.

For hierarchical clustering, alpha controls the ratio between $\mathbf r^{\min}$ and $\mathbf r^{\max}$ ($\alpha$ is set to 0.6 by default). cluster_size1 and cluster_size2 control the granularities of two hierarchical levels (set cluster_size1=32 and cluster_size2=16 by default). ratio1 and ratio2 control the dynamic selection ratio at the 1st hierarchy level and the overall dynamic selection ratio, respectively.

The tools are implemented in utils.py. For example, group_key_min_max computes $\mathbf r^{\min}$ and $\mathbf r^{\max}$ for each cluster. groupidx_to_tokenidx converts the selected group indices into the corresponding token indices.

The core algorithm of similarity approximation is shown below:

Description

Description

Model inference and evaluation:

First, run pred_mine.py to perform the model inference on longchat-v1.5-7b-32k:

CUDA_VISIBLE_DEVICES=0 python pred_mine.py --model longchat-v1.5-7b-32k

You can also run inference on multi-gpus in parallel (one model per gpu):

CUDA_VISIBLE_DEVICES=0,1,2,3 python pred_mine.py --model longchat-v1.5-7b-32k

Then, you can obtain the inference output of the model on the dataset under the pred_mine/ folder corresponding to the model name.

After the inference, run eval_mine.py to evaluate the model performance (no need of GPU):

python eval_mine.py --model longchat-v1.5-7b-32k

We can get the results on the datasets in result.json.

For MPC inference, we use Secretflow with SPU to evaluate the efficiency, which can be found in infer/

Reference

@article{zeng2025mpcache,
  title={MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference},
  author={Zeng, Wenxuan and Dong, Ye and Zhou, Jinjin and Ma, Junming and Tan, Jin and Wang, Runsheng and Li, Meng},
  journal={arXiv preprint arXiv:2501.06807},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors