LLM Memory & RAG 相关度: 9/10

VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

Yixuan Wang, Qingyu Shi, Jiayu Zhou, Dianbo Liu, Ziwei He, Zhouhan Lin

arXiv: 2603.16435v1 发布: 2026-03-17 更新: 2026-03-17

下载 PDF arXiv 页面

AI 摘要

VQKV通过向量量化实现KV缓存高压缩率和高保真度，显著提升LLM在资源受限环境下的部署能力。

主要贡献

提出了一种新的基于向量量化的KV缓存压缩方法VQKV
在保证高压缩率的同时，维持了较高的模型性能
实现了更长的生成长度，降低了内存占用

方法论

使用向量量化（VQ）将大量浮点数值压缩为少量整数索引，从而实现KV缓存的高压缩率和高保真度。

原文摘要

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

arXiv 分类

cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类