LLM Reasoning 相关度: 9/10

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Shaojie Shi, Zhengyu Shi, Lingran Zheng, Xinyu Su, Anna Xie, Bohao Lv, Rui Xu, Zijian Chen, Zhichao Chen, Guolei Liu, Naifu Zhang, Mingjian Dong, Zhuo Quan, Bohao Chen, Teqi Hao, Yuan Qi, Yinghui Xu, Libo Wu
arXiv: 2603.15542v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

InterveneBench基准测试LLM在真实社会系统干预推理和因果研究设计的能力,发现现有LLM表现不佳,并提出STRIDES框架。

主要贡献

  • 提出了InterveneBench基准测试,用于评估LLM在社会科学干预推理方面的能力
  • 发现现有LLM在InterveneBench上的表现不佳
  • 提出了STRIDES多智能体框架,并在InterveneBench上取得了显著的性能提升

方法论

构建包含744个社会科学研究的基准测试,评估LLM对干预政策和识别假设的推理能力,并提出多智能体框架STRIDES提升性能。

原文摘要

Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art reasoning models. Our code and data are available at https://github.com/Sii-yuning/STRIDES.

标签

causal inference LLM benchmark social science intervention

arXiv 分类

cs.CY cs.AI