LLM Memory & RAG 相关度: 9/10

VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

Yixuan Wang, Qingyu Shi, Jiayu Zhou, Dianbo Liu, Ziwei He, Zhouhan Lin
arXiv: 2603.16435v1 发布: 2026-03-17 更新: 2026-03-17

AI 摘要

VQKV通过向量量化实现KV缓存高压缩率和高保真度,显著提升LLM在资源受限环境下的部署能力。

主要贡献

  • 提出了一种新的基于向量量化的KV缓存压缩方法VQKV
  • 在保证高压缩率的同时,维持了较高的模型性能
  • 实现了更长的生成长度,降低了内存占用

方法论

使用向量量化(VQ)将大量浮点数值压缩为少量整数索引,从而实现KV缓存的高压缩率和高保真度。

原文摘要

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

标签

KV缓存压缩 向量量化 大语言模型 低资源部署

arXiv 分类

cs.CL