LLM Reasoning 相关度: 8/10

Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

Eva Paraschou, Line Harder Clemmensen, Sneha Das
arXiv: 2602.16438v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

研究表明,LLM公平性对齐在单一属性上优化可能导致其他属性的偏差加剧,存在偏差溢出效应。

主要贡献

  • 揭示了LLM对齐中目标属性的公平性优化可能导致其他属性的偏差溢出效应
  • 通过实验证明了在模糊语境下,改善一个属性的公平性可能恶化其他属性的不公平性
  • 强调了在LLM公平性评估中进行语境感知和多属性评估的重要性

方法论

使用直接偏好优化(DPO)对三个LLM进行性别对齐,并使用BBQ基准在模糊和非模糊语境下评估公平性。

原文摘要

Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a phenomenon known as bias spillover. While extensively studied in machine learning, bias spillover remains critically underexplored in LLM alignment. In this work, we investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B). Using Direct Preference Optimization and the BBQ benchmark, we evaluate fairness under ambiguous and disambiguous contexts. Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts, particularly for physical appearance ($p< 0.001$ across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.

标签

LLM Fairness Bias Spillover Alignment

arXiv 分类

cs.LG cs.AI