Agent Tuning & Optimization 相关度: 8/10

DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

Dingwei Zhu, Zhiheng Xi, Shihan Dou, Jiahan Li, Chenhao Huang, Junjie Ye, Sixian Li, Mingxu Chai, Yuhui Wang, Yajie Yang, Ming Zhang, Jiazheng Zhang, Shichun Liu, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
arXiv: 2602.05890v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

DFPO通过学习值流建模,提升LLM在噪声环境下的鲁棒性和泛化能力。

主要贡献

  • 提出DFPO框架,建模连续值流而非独立分位数
  • 引入条件风险控制和一致性约束,稳定训练
  • 实验证明DFPO在噪声环境下优于其他方法

方法论

DFPO通过学习值流场捕获更丰富的状态信息,使用条件风险控制和一致性约束稳定训练,优化策略。

原文摘要

Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.

标签

强化学习 LLM后训练 分布强化学习 鲁棒性 泛化性

arXiv 分类

cs.LG cs.CL