LLM Reasoning 相关度: 9/10

MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai
arXiv: 2602.17550v1 发布: 2026-02-19 更新: 2026-02-19

AI 摘要

MASPO通过统一梯度利用、概率质量和信号可靠性,提升LLM推理的鲁棒性和样本效率。

主要贡献

  • 提出MASPO框架,统一梯度利用、概率质量和信号可靠性。
  • 引入可微软高斯门控,最大化梯度效用。
  • 设计质量自适应限制器,平衡概率谱上的探索。
  • 采用非对称风险控制器,使更新幅度与信号置信度对齐。

方法论

MASPO框架通过软门控、质量自适应限制器和非对称风险控制,优化强化学习中的策略更新,提升LLM推理性能。

原文摘要

Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming strong baselines. Our code is available at: https://anonymous.4open.science/r/ma1/README.md.

标签

LLM 强化学习 策略优化 推理

arXiv 分类

cs.LG cs.AI