Adaptive Loops and Memory in Transformers: Think Harder or Know More?
AI 摘要
该论文研究了循环Transformer和记忆模块在提升语言模型推理能力上的作用,以及它们的组合效果。
主要贡献
- 提出结合自适应循环和记忆模块的Transformer模型
- 发现循环主要提升数学推理能力,记忆模块提升常识推理能力
- 分析模型内部机制,发现层之间的功能专业化
方法论
构建具有自适应循环机制和门控记忆库的Transformer模型,并在数学和常识推理任务上进行评估和分析。
原文摘要
Chain-of-thought (CoT) prompting enables reasoning in language models but requires explicit verbalization of intermediate steps. Looped transformers offer an alternative by iteratively refining representations within hidden states. This parameter efficiency comes at a cost, as looped models lack the storage capacity of deeper models which use unique weights per layer. In this work, we investigate transformer models that feature both adaptive per-layer looping, where each transformer block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks, that provide additional learned storage. We find that looping primarily benefits mathematical reasoning, while memory banks help recover performance on commonsense tasks compared to parameter and FLOP matched models. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline -- with three times the number of layers -- on math benchmarks. Analysis of model internals reveals layer specialization: early layers learn to loop minimally and access memory sparingly, while later layers do both more heavily.