AI Agents 相关度: 9/10

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

John Ray B. Martinez
arXiv: 2603.24481v1 发布: 2026-03-25 更新: 2026-03-25

AI 摘要

多智能体推理结合一致性验证,显著提升医疗多选题不确定性校准。

主要贡献

  • 提出基于多智能体的医学问答框架,利用领域专家提高性能。
  • 引入两阶段验证,通过一致性评估校准置信度。
  • 实验证明,该方法显著降低了ECE,提升了不确定性评估的可靠性。

方法论

构建呼吸科、心脏科等专家智能体,使用Qwen2.5-7B推理,通过两阶段自验证产生置信度权重,进行答案融合。

原文摘要

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.

标签

医疗问答 多智能体 不确定性校准 一致性验证

arXiv 分类

cs.AI cs.CL cs.LG