Calibrated Confidence Expression for Radiology Report Generation
AI 摘要
ConRad通过强化学习微调医学LVLM,生成校准的置信度表达,提升放射报告生成的安全性。
主要贡献
- 提出 ConRad 框架,提升放射报告置信度校准
- 采用 GRPO 算法,基于对数评分规则训练模型
- 临床评估表明,ConRad 的报告级分数与临床医生判断一致
方法论
使用 GRPO 算法对医学 LVLM 进行微调,训练模型输出校准的置信度估计,并结合对数评分规则作为奖励函数。
原文摘要
Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.