Multimodal Learning 相关度: 9/10

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham
arXiv: 2602.22678v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

ViCLIP-OT是首个越南语图像-文本检索基础模型,结合对比学习和最优传输损失。

主要贡献

  • 提出了专门为越南语图像-文本检索设计的ViCLIP-OT模型
  • 集成了CLIP风格的对比学习和SIGROT损失,增强跨模态一致性
  • 在越南语基准数据集上显著优于CLIP和SigLIP,尤其在zero-shot设置下

方法论

结合CLIP对比学习和Similarity-Graph Regularized Optimal Transport (SIGROT) 损失,优化图像和文本在嵌入空间的对齐。

原文摘要

Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.

标签

vision-language model image-text retrieval Vietnamese optimal transport low-resource language

arXiv 分类

cs.CV cs.AI