Inference-Time Reasoning Selectively Reduces Implicit Social Bias in Large Language Models
AI 摘要
推理能力能在一定程度上减少大语言模型中内隐的社会偏见。
主要贡献
- 发现推理能显著减少LLM的内隐社会偏见
- 揭示了这种减少偏见效应的领域特异性 (仅在社会偏见领域)
- 强调了认知科学和心理学理论在AI评估中的价值
方法论
采用类似内隐联想测验(IAT)的方法评估LLM的内隐偏见,比较有无推理能力时的偏见程度。
原文摘要
Drawing on constructs from psychology, prior work has identified a distinction between explicit and implicit bias in large language models (LLMs). While many LLMs undergo post-training alignment and safety procedures to avoid expressions of explicit social bias, they still exhibit significant implicit biases on indirect tasks resembling the Implicit Association Test (IAT). Recent work has further shown that inference-time reasoning can impair LLM performance on tasks that rely on implicit statistical learning. Motivated by a theoretical link between implicit associations and statistical learning in human cognition, we examine how reasoning-enabled inference affects implicit bias in LLMs. We find that enabling reasoning significantly reduces measured implicit bias on an IAT-style evaluation for some model classes across fifteen stereotype topics. This effect appears specific to social bias domains, as we observe no corresponding reduction for non-social implicit associations. As reasoning is increasingly enabled by default in deployed LLMs, these findings suggest that it can meaningfully alter fairness evaluation outcomes in some systems, while also raising questions about how alignment procedures interact with inference-time reasoning to drive variation in bias reduction across model types. More broadly, this work highlights how theory from cognitive science and psychology can complement AI evaluation research by providing methodological and interpretive frameworks that reveal new insights into model behavior.