Multimodal Learning 相关度: 9/10

Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

Hong Liu, Dong Wei, Qiong Peng, Yawen Huang, Xian Wu, Yefeng Zheng, Liansheng Wang
arXiv: 2603.04878v1 发布: 2026-03-05 更新: 2026-03-05

AI 摘要

提出一个两阶段结构驱动的图像-文本对比学习框架,用于自动生成CT报告,提高临床效率。

主要贡献

  • 引入结构感知的图像-文本对比学习
  • 提出基于文本相似性的软伪标签缓解假阴性
  • 动态多样性增强的负样本队列
  • 冻结视觉结构查询并选择关键图像块

方法论

两阶段框架:结构学习阶段使用对比学习进行结构语义对齐,报告生成阶段冻结视觉查询,用解码器生成报告。

原文摘要

Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.

标签

CT报告生成 图像-文本对比学习 结构感知 医学影像

arXiv 分类

cs.CV