Multimodal Learning 相关度: 9/10

Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

Ran Zhang, Yucong Lin, Zhaoli Su, Bowen Liu, Danni Ai, Tianyu Fu, Deqiang Xiao, Jingfan Fan, Yuanyuan Wang, Mingwei Gao, Yuwan Hu, Shuya Gao, Jingtao Li, Jian Yang, Hong Song, Hongliang Sun

arXiv: 2603.22935v1 发布: 2026-03-24 更新: 2026-03-24

下载 PDF arXiv 页面

AI 摘要

提出了Ran Score，一种基于LLM的放射报告生成评估指标，特别关注低频异常和临床语言。

主要贡献

提出了Ran Score评估指标
结合人类专家知识和LLM进行多标签发现提取
优化prompt以提高与放射科医生参考标准的匹配度

方法论

使用临床医生指导的框架，结合LLM和人工标注数据，进行prompt优化，评估报告生成模型的性能。

原文摘要

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

arXiv 分类

cs.AI cs.HC

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类