LLM Reasoning 相关度: 9/10

Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models

Isabel Tsintsiper, Sheng Wong, Beth Albert, Shaun P Brennecke, Gabriel Davis Jones
arXiv: 2602.04392v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

评估大型语言模型在临床推理中存在的性别偏见,发现不同模型存在稳定的性别偏向。

主要贡献

  • 系统性评估LLM在临床推理中的性别偏见
  • 发现不同LLM模型存在稳定的、模型特定的性别偏向
  • 强调了在医疗保健领域部署通用LLM时需要谨慎和持续的监督

方法论

使用50个临床医生编写的临床案例,针对四个通用LLM进行三项实验,评估其性别偏向。

原文摘要

Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.

标签

LLM 临床推理 性别偏见 医疗保健

arXiv 分类

cs.CL