Multimodal Learning 相关度: 9/10

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez
arXiv: 2603.25722v1 发布: 2026-03-26 更新: 2026-03-26

AI 摘要

提出概念中心学习方法,提升对比视觉语言模型在组合性任务上的表现,同时保持零样本能力。

主要贡献

  • 提出概念中心学习框架,解决视觉语言模型的组合性问题。
  • 使用短概念中心标题部分对齐图像。
  • 引入无参数跨模态注意力池化,获得概念中心视觉嵌入。

方法论

通过NLP软件获取短概念标题,与图像对齐,并使用跨模态注意力池化获得视觉嵌入,结合对比损失进行训练。

原文摘要

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

标签

视觉语言模型 组合性 对比学习 概念中心学习

arXiv 分类

cs.CV cs.LG