Multimodal Learning 相关度: 9/10

Large Multimodal Models as General In-Context Classifiers

Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci

arXiv: 2602.23229v1 发布: 2026-02-26 更新: 2026-02-26

下载 PDF arXiv 页面

AI 摘要

论文研究了大型多模态模型在上下文学习中的分类能力，并提出了CIRCLE方法提升开放世界分类效果。

主要贡献

论证了LMMs在上下文学习中作为分类器的潜力。
提出了CIRCLE方法，提升LMMs在开放世界分类中的鲁棒性。
通过实验证明LMMs可以作为统一的分类器，替代专用模型。

方法论

通过基准测试对比LMMs与CLIP类VLM的分类性能，并提出一种无需训练的上下文伪标签优化方法CIRCLE。

原文摘要

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

arXiv 分类

cs.CV