Multimodal Learning 相关度: 9/10

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li, Zhengming Ding

arXiv: 2602.19615v1 发布: 2026-02-23 更新: 2026-02-23

下载 PDF arXiv 页面

AI 摘要

针对视觉语言模型在罕见物体推理上的不足，提出一种高效的即插即用模块，提升模型性能。

主要贡献

提出了多模态类别嵌入学习方法，利用视觉基础模型和文本描述弥补罕见物体训练数据不足。
设计了基于注意力的增强模块，精细化视觉 tokens，改善模型对细节的感知。
利用学习到的嵌入作为物体感知检测器，生成提示信息，引导模型关注相关区域。

方法论

通过学习多模态类别嵌入，并结合注意力机制增强视觉tokens和生成提示信息，提升视觉语言模型对罕见物体的推理能力。

原文摘要

Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM's attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM's ability to focus on and reason about rare objects.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类