Agent Tuning & Optimization 相关度: 9/10

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Wenbin Wu
arXiv: 2603.08412v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

人类和AI反馈中存在“选择盲视”现象,导致RLHF训练信号被扭曲,标准评估指标难以检测。

主要贡献

  • 揭示了人类在评估偏好时存在选择盲视现象
  • 发现LLM的偏好判断依赖于浅层文本匹配,而非真正的自我监控
  • 证明RLHF的reward信号容易被噪声干扰,且标准指标无法有效检测

方法论

通过人类实验、LLM实验和reward信号腐蚀实验,研究了偏好反馈过程中的选择盲视现象。

原文摘要

Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.

标签

RLHF 选择盲视 偏好学习 LLM评估

arXiv 分类

cs.CL cs.AI