LLM Reasoning 相关度: 8/10

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Kensuke Okada, Yui Furukawa, Kyosuke Bunji
arXiv: 2602.17262v1 发布: 2026-02-19 更新: 2026-02-19

AI 摘要

提出一种量化和缓解LLM在问卷评估中社会期望偏差的方法,并用强制选择问卷减少偏差。

主要贡献

  • 提出了量化LLM中社会期望偏差的心理测量框架。
  • 构建了梯度强制选择(GFC)Big Five问卷,以匹配期望。
  • 展示了GFC问卷在减轻社会期望偏差方面的有效性,同时保留了对人物档案的恢复能力。

方法论

使用HONEST和FAKE-GOOD指令管理问卷,通过IRT估计潜在分数并计算SDR。构建GFC问卷通过约束优化匹配期望值。

原文摘要

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

标签

LLM Psychometrics Social Desirability Bias Questionnaire Evaluation

arXiv 分类

cs.CL stat.ME