This human study did not involve human subjects: Validating LLM simulations as behavioral evidence
AI 摘要
该论文探讨了使用LLM模拟人类行为的有效性,提出了启发式方法和统计校准两种策略。
主要贡献
- 对比了启发式方法和统计校准两种LLM模拟策略
- 阐明了不同策略在探索性研究和验证性研究中的适用性
- 强调了评估LLM近似真实人群能力的重要性
方法论
论文对比了两种LLM模拟策略:启发式方法(prompt工程等)和统计校准(结合辅助人类数据)。
原文摘要
A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.