Agent Tuning & Optimization 相关度: 9/10

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Wenbin Wu

arXiv: 2603.08412v1 发布: 2026-03-09 更新: 2026-03-09

下载 PDF arXiv 页面

AI 摘要

人类和AI反馈中存在“选择盲视”现象，导致RLHF训练信号被扭曲，标准评估指标难以检测。

主要贡献

揭示了人类在评估偏好时存在选择盲视现象
发现LLM的偏好判断依赖于浅层文本匹配，而非真正的自我监控
证明RLHF的reward信号容易被噪声干扰，且标准指标无法有效检测

方法论

通过人类实验、LLM实验和reward信号腐蚀实验，研究了偏好反馈过程中的选择盲视现象。

原文摘要

Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.

arXiv 分类

cs.CL cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类