LLM Reasoning 相关度: 9/10

LAD: Learning Advantage Distribution for Reasoning

Wendi Li, Sharon Li
arXiv: 2602.20132v1 发布: 2026-02-23 更新: 2026-02-23

AI 摘要

LAD通过学习优势分布解决LLM推理中奖励信号过拟合问题,提升推理能力和生成多样性。

主要贡献

  • 提出Learning Advantage Distributions (LAD)框架
  • 证明最优策略更新与基于优势的目标分布之间的等价性
  • 实验证明LAD能提升数学和代码推理的准确性和生成多样性

方法论

用最小化f-散度的方式,使策略诱导分布与优势诱导分布匹配,防止过拟合并提升生成多样性。

原文摘要

Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.

标签

reinforcement learning reasoning large language models advantage distribution

arXiv 分类

cs.LG