Multimodal Learning 相关度: 9/10

Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

Lexiang Hu, Youze Xue, Dian Li, Gang Liu, Zhouchen Lin

arXiv: 2602.05729v1 发布: 2026-02-05 更新: 2026-02-05

下载 PDF arXiv 页面

AI 摘要

提出了AGFF-Embed，自适应融合全局和细粒度信息的MLLM嵌入，并结合EGA提升性能。

主要贡献

提出AGFF-Embed模型，融合全局和细粒度感知
利用MLLM生成不同语义维度的嵌入
结合EGA技术实现hard negative增强
在MMEB和MMVP-VLM基准测试上达到SOTA

方法论

通过prompt MLLM生成多维度语义信息嵌入，自适应融合，并结合EGA实现hard negative挖掘。

原文摘要

Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.

arXiv 分类

cs.CV cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类