VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization
AI 摘要
VQKV通过向量量化实现KV缓存高压缩率和高保真度,显著提升LLM在资源受限环境下的部署能力。
主要贡献
- 提出了一种新的基于向量量化的KV缓存压缩方法VQKV
- 在保证高压缩率的同时,维持了较高的模型性能
- 实现了更长的生成长度,降低了内存占用
方法论
使用向量量化(VQ)将大量浮点数值压缩为少量整数索引,从而实现KV缓存的高压缩率和高保真度。
原文摘要
The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.