LLM Reasoning 相关度: 7/10

The Key to State Reduction in Linear Attention: A Rank-based Perspective

Philipp Nazari, T. Konstantin Rusch
arXiv: 2602.04852v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

分析线性注意力模型低秩现象,提出硬件感知结构化剪枝方法,减少模型状态大小。

主要贡献

  • 理论分析了线性注意力中秩对检索误差的影响
  • 提出了基于秩分解的结构化剪枝方法,用于减少状态大小
  • 验证了剪枝后模型性能下降很小,且兼容现有CUDA内核

方法论

通过理论分析揭示低秩影响,提出基于秩分解的结构化剪枝方法,并进行实验验证。

原文摘要

Linear attention offers a computationally efficient yet expressive alternative to softmax attention. However, recent empirical results indicate that the state of trained linear attention models often exhibits a low-rank structure, suggesting that these models underexploit their capacity in practice. To illuminate this phenomenon, we provide a theoretical analysis of the role of rank in linear attention, revealing that low effective rank can affect retrieval error by amplifying query noise. In addition to these theoretical insights, we conjecture that the low-rank states can be substantially reduced post-training with only minimal performance degradation, yielding faster and more memory-efficient models. To this end, we propose a novel hardware-aware approach that structurally prunes key and query matrices, reducing the state size while retaining compatibility with existing CUDA kernels. We adapt several existing pruning strategies to fit our framework and, building on our theoretical analysis, propose a novel structured pruning method based on a rank-revealing QR decomposition. Our empirical results, evaluated across models of varying sizes and on various downstream tasks, demonstrate the effectiveness of our state reduction framework. We highlight that our framework enables the removal of 50% of the query and key channels at only a marginal increase in perplexity. The code for this project can be found at https://github.com/camail-official/LinearAttentionPruning.

标签

Linear Attention Model Pruning Low-Rank Approximation

arXiv 分类

cs.LG