Multimodal Learning 相关度: 9/10

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

arXiv: 2603.02767v1 发布: 2026-03-03 更新: 2026-03-03

下载 PDF arXiv 页面

AI 摘要

ITO通过多重对齐和训练时融合，提升图像-文本对比学习的模态一致性和表征能力。

主要贡献

提出ITO框架，结合多重对齐和训练时融合
多重对齐增强了图像-文本对应关系的监督
训练时融合模块作为结构化正则化器，消除模态差距

方法论

利用多模态多重对齐和训练时多模态融合模块，在训练过程中增强跨模态交互，推理时移除融合模块。

原文摘要

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

arXiv 分类

cs.CV cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类