Multimodal Learning 相关度: 9/10

Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei

arXiv: 2603.09480v1 发布: 2026-03-10 更新: 2026-03-10

下载 PDF arXiv 页面

AI 摘要

PruneSID通过协同重要性和多样性，高效压缩VLM中的视觉Token，提升推理速度。

主要贡献

提出了一种训练无关的视觉Token压缩方法PruneSID
设计了Principal Semantic Components Analysis (PSCA) 用于token聚类
引入了Intra-group Non-Maximum Suppression (NMS) 筛选关键token
提出了信息感知的动态压缩率机制

方法论

采用两阶段pipeline：PSCA聚类Token，NMS筛选代表性Token，并使用动态压缩率机制。

原文摘要

Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID}{https://github.com/ZhengyaoFang/PruneSID.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类