LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards
AI 摘要
LongR通过强化学习和密集奖励,提升LLM在长文本推理中的表现。
主要贡献
- 提出LongR框架,结合动态“思考与阅读”机制。
- 引入基于相对信息增益的上下文密度奖励。
- 在LongBench v2、RULER和InfiniteBench上取得显著提升。
方法论
LongR使用强化学习,通过动态调整思考和阅读过程,并以密集奖励函数引导模型学习。
原文摘要
Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.