IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models
AI 摘要
提出一种基于注意力的视觉语言模型(LVLM)token剪枝方法,旨在提高效率,降低计算成本。
主要贡献
- 将token剪枝视为隐式权重剪枝
- 提出了基于信息量和信息冗余度的token选择指标
- 引入渐进分块最大边缘相关性(Progressive Chunked Maximal Marginal Relevance)算法
方法论
通过将注意力机制重构为隐式线性层,推导出token重要性度量,并使用提出的算法进行token选择。
原文摘要
Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.