Agent Tuning & Optimization 相关度: 9/10

How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics

Yurong Chen, Yu He, Michael I. Jordan, Fan Yao
arXiv: 2602.12180v1 发布: 2026-02-12 更新: 2026-02-12

AI 摘要

该论文理论分析了采样方法对LLM对齐的影响,揭示了采样偏差可能导致的对齐问题。

主要贡献

  • 证明了实例相关的采样可以增强排序保证
  • 揭示了片面的策略采样可能导致过度集中
  • 分析了迭代对齐动态中出现的振荡和熵坍塌现象

方法论

通过理论分析Identity Preference Optimization和Direct Preference Optimization框架,结合实验验证。

原文摘要

Standard methods for aligning large language models with human preferences learn from pairwise comparisons among sampled candidate responses and regularize toward a reference policy. Despite their effectiveness, the effects of sampling and reference choices are poorly understood theoretically. We investigate these effects through Identity Preference Optimization, a widely used preference alignment framework, and show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences. We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies, reflecting a common practice of model-generated preference data. We prove that these dynamics can exhibit persistent oscillations or entropy collapse for certain parameter choices, and characterize regimes that guarantee stability. Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods. Experiments on real-world preference data validate our findings.

标签

LLM对齐 采样方法 偏好优化 理论分析

arXiv 分类

cs.LG cs.GT