LLM Reasoning 相关度: 9/10

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang
arXiv: 2602.19895v1 发布: 2026-02-23 更新: 2026-02-23

AI 摘要

提出DSDR框架,通过双尺度多样性正则化增强LLM推理中基于强化学习的探索,提升推理性能。

主要贡献

  • 提出双尺度多样性正则化(DSDR)框架
  • 设计全局和局部多样性组件,促进不同推理模式的探索
  • 提出全局到局部的分配机制,提升学习信号
  • 提供理论支持,证明DSDR的正确性

方法论

利用强化学习框架,通过全局多样性促进不同推理轨迹探索,局部多样性防止熵坍塌,并通过分配机制耦合两者。

原文摘要

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.

标签

LLM Reasoning Reinforcement Learning Diversity Regularization

arXiv 分类

cs.LG cs.CL