LLM Reasoning 相关度: 9/10

On the Learning Dynamics of RLVR at the Edge of Competence

Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, Yuxin Chen
arXiv: 2602.14872v1 发布: 2026-02-16 更新: 2026-02-16

AI 摘要

论文研究了RLVR在复杂推理任务中的训练动态,揭示了数据难度谱对学习效果的影响。

主要贡献

  • 提出了RLVR在Transformer中训练动态的理论
  • 揭示了数据难度谱平滑性对RLVR性能的影响
  • 利用傅里叶分析工具分析了学习过程

方法论

通过理论分析和合成实验,验证了数据难度谱的平滑性与RLVR性能的关系。

原文摘要

Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks. Our theory characterizes how the effectiveness of RLVR is governed by the smoothness of the difficulty spectrum. When data contains abrupt discontinuities in difficulty, learning undergoes grokking-type phase transitions, producing prolonged plateaus before progress recurs. In contrast, a smooth difficulty spectrum leads to a relay effect: persistent gradient signals on easier problems elevate the model's capabilities to the point where harder ones become tractable, resulting in steady and continuous improvement. Our theory explains how RLVR can improve performance at the edge of competence, and suggests that appropriately designed data mixtures can yield scalable gains. As a technical contribution, our analysis develops and adapts tools from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.

标签

强化学习 Transformer 推理 学习动态 傅里叶分析

arXiv 分类

cs.LG cs.AI math.OC stat.ML