AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue
AI 摘要
AIDG框架评估LLM在多轮对话中信息提取与包含的不对称性,揭示其推理瓶颈。
主要贡献
- 提出了AIDG评估框架,用于评估LLM的战略推理能力
- 设计了AIDG-I和AIDG-II两个任务,分别测量社交推理和约束满足
- 发现LLM在信息包含方面优于信息提取,存在能力不对称
方法论
设计博弈论框架AIDG,包含社交演绎和“20问”两个任务,通过与LLM进行多轮对话,分析其在信息提取和包含方面的能力差异。
原文摘要
Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.