Multimodal Learning 相关度: 9/10

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He
arXiv: 2603.02767v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

ITO通过多重对齐和训练时融合,提升图像-文本对比学习的模态一致性和表征能力。

主要贡献

  • 提出ITO框架,结合多重对齐和训练时融合
  • 多重对齐增强了图像-文本对应关系的监督
  • 训练时融合模块作为结构化正则化器,消除模态差距

方法论

利用多模态多重对齐和训练时多模态融合模块,在训练过程中增强跨模态交互,推理时移除融合模块。

原文摘要

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

标签

图像-文本对比学习 多模态学习 视觉表征学习

arXiv 分类

cs.CV cs.AI