LLM Memory & RAG 相关度: 9/10

Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

arXiv: 2602.12204v1 发布: 2026-02-12 更新: 2026-02-12

下载 PDF arXiv 页面

AI 摘要

提出了一种基于记忆整合的自适应计算缩减方法,通过动态减少冗余注意力计算提高效率。

主要贡献

发现GPT-2模型中大量注意力操作是冗余的，并提出CRMA解决此问题。
引入基于整合的路由机制CRMA，实现注意力利用率随训练过程下降。
证明了没有整合机制，静态路由方案的注意力复杂度是Ω(f * n)。

方法论

提出了一个受生物启发的记忆整合机制，将情景检索逐步提炼为参数化语义记忆，减少冗余注意力计算。

原文摘要

Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits \emph{decreasing attention utilization} over training, achieving a \textbf{37.8$\times$} reduction through a sharp phase transition at approximately 3K steps. We prove that this capability is \emph{impossible} without consolidation: any static routing scheme requires $Ω(f \cdot n)$ attention for tasks with recurring patterns of frequency $f$. On our proposed SRCD benchmark, \ours{} achieves \textbf{100\% retrieval accuracy} at 1.6\% attention compute (vs.\ 68\% for baselines), and consolidated patterns transfer to unseen tasks with \textbf{48--52\%} attention reduction without retraining. Remarkably, the learned consolidation dynamics quantitatively match human episodic-to-semantic memory transition curves from cognitive psychology ($γ= 0.43$ vs.\ $γ_{\text{human}} \approx 0.4$--$0.5$). Code and benchmarks are available at [anonymized].

arXiv 分类

cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类