Aligning to Illusions: Choice Blindness in Human and AI Feedback
AI 摘要
人类和AI反馈中存在“选择盲视”现象,导致RLHF训练信号被扭曲,标准评估指标难以检测。
主要贡献
- 揭示了人类在评估偏好时存在选择盲视现象
- 发现LLM的偏好判断依赖于浅层文本匹配,而非真正的自我监控
- 证明RLHF的reward信号容易被噪声干扰,且标准指标无法有效检测
方法论
通过人类实验、LLM实验和reward信号腐蚀实验,研究了偏好反馈过程中的选择盲视现象。
原文摘要
Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.