Multimodal Learning 相关度: 9/10

ApET: Approximation-Error Guided Token Compression for Efficient VLMs

Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, Hairong Zheng
arXiv: 2602.19870v1 发布: 2026-02-23 更新: 2026-02-23

AI 摘要

ApET通过近似误差引导的token压缩方法,在保证性能的同时显著提升了VLMs的推理效率。

主要贡献

  • 提出基于近似误差的视觉Token压缩框架ApET
  • 无需依赖attention,兼容FlashAttention等高效attention kernel
  • 在多个VLM和benchmark上验证了ApET的有效性

方法论

通过线性近似重构视觉Token,利用近似误差识别并丢弃信息量低的Token,实现Token压缩。

原文摘要

Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.

标签

VLM Token Compression Attention-free Approximation Error Inference Acceleration

arXiv 分类

cs.CV