Agent Tuning & Optimization 相关度: 9/10

Mitigating Mismatch within Reference-based Preference Optimization

Suqin Yuan, Xingrui Yu, Jiyang Zheng, Lei Feng, Dadong Wang, Ivor Tsang, Tongliang Liu
arXiv: 2602.11902v1 发布: 2026-02-12 更新: 2026-02-12

AI 摘要

针对DPO在悲观样本上的“过早满足”问题,提出了Hybrid-DPO(HyPO),通过有条件地去偏参考信号来提升对齐效果。

主要贡献

  • 指出了DPO在悲观样本上的“过早满足”问题,并将其定义为一种训练-推理不匹配。
  • 提出了Hybrid-DPO(HyPO),一种DPO的改进版本,能够有条件地利用参考信号,缓解“过早满足”问题。
  • 实验证明HyPO能够提升推理对齐指标和pairwise胜率,表明有条件地去偏参考信号可以增强直接偏好对齐的效果。

方法论

提出了HyPO,修改DPO的目标函数,对悲观样本将参考模型输出视为中性,以增强学习信号,同时保留DPO的目标形式和计算成本。

原文摘要

Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension. DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response. For these pairs, DPO prematurely attenuates the gradient as soon as the policy margin ($Δ_θ$) merely beats the reference margin ($Δ_{\mathrm{ref}}$) even if the policy is still wrong ($Δ_θ<0$). We name this failure premature satisfaction, which is a concrete form of the training-inference mismatch. Reference-free objectives remove this mismatch by optimizing the absolute margin, but at the cost of discarding the stabilizing signal of the reference. We mitigate this tension with Hybrid-DPO (HyPO), a drop-in modification to DPO that applies reference conditionally: HyPO behaves exactly like DPO when the reference is optimistic or neutral, and it treats the reference as neutral when it is pessimistic by replacing $Δ_θ-Δ_{\mathrm{ref}}$ with $Δ_θ-\max\{0,Δ_{\mathrm{ref}}\}$. This one-line change strictly strengthens per-example learning signals on pessimistic pairs while preserving DPO's objective form and computational cost. By conditionally debiasing the pessimistic reference signal, HyPO mitigates premature satisfaction; empirically, across preference alignment, HyPO improves inference-aligned metrics and achieves higher pairwise win rates. Our results provide evidence that direct preference alignment could be enhanced by conditionally debiasing the reference signal, rather than discarding it.

标签

DPO Preference Optimization Reference Policy Training-Inference Mismatch Hybrid-DPO

arXiv 分类

cs.LG cs.AI