LLM Memory & RAG 相关度: 9/10

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen
arXiv: 2602.08585v1 发布: 2026-02-09 更新: 2026-02-09

AI 摘要

LU-KV通过优化头级别缓存分配,减少KV缓存大小,降低推理延迟和显存占用。

主要贡献

  • 提出LU-KV框架,通过凸包松弛和边际效用贪婪求解器优化头级别缓存分配
  • 引入数据驱动的离线分析协议,便于LU-KV的实际部署
  • 在LongBench和RULER基准测试上验证了LU-KV的有效性

方法论

基于长期语义信息保持的边际效用,通过凸包松弛和贪婪算法优化头级别预算分配。

原文摘要

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Based on this insight, we propose LU-KV, a novel framework that optimizes head-level budget allocation through a convex-hull relaxation and a marginal-utility-based greedy solver to achieve near-optimal precision. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Extensive evaluations on LongBench and RULER benchmarks demonstrate that LU-KV achieves an 80% reduction in KV cache size with minimal performance degradation, while simultaneously reducing inference latency and GPU memory footprint.

标签

KV Cache Attention Inference Optimization Memory Management

arXiv 分类

cs.LG cs.AI