Multimodal Learning 相关度: 9/10

Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

Adhemar de Senneville, Xavier Bou, Jérémy Anger, Rafael Grompone, Gabriele Facciolo
arXiv: 2603.24181v1 发布: 2026-03-25 更新: 2026-03-25

AI 摘要

LVLMs可通过prompt conditioning和head选择提升zero-shot和few-shot图像分类性能,缩小与CLIP的差距。

主要贡献

  • 提出Head Ensemble Classifiers (HEC),一种无训练的分类器。
  • 发现LVLMs的内部表示(尤其是注意力头)在分类任务中表现优异。
  • 通过prompt conditioning改善LVLMs的视觉特征类别可分性。

方法论

通过prompt conditioning改善特征可分性,利用Gaussian Discriminant Analysis选择最优视觉和文本注意力头,集成成HEC分类器。

原文摘要

Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP's architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs' internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.

标签

LVLM Few-shot Learning Zero-shot Learning Image Classification Attention Heads

arXiv 分类

cs.CV