ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning
AI 摘要
ATTNPO利用模型注意力机制进行过程监督,有效减少推理冗余并提升性能。
主要贡献
- 提出了一种低开销的过程监督强化学习框架ATTNPO
- 利用模型的注意力信号进行步进式信用分配
- 通过抑制冗余步骤和减少对必要步骤的惩罚来缓解过度思考
方法论
通过识别特殊注意力头,利用其分数区分必要和冗余步骤,进而优化奖励函数。
原文摘要
Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.