Multimodal Learning 相关度: 9/10

Semi-Supervised Few-Shot Adaptation of Vision-Language Models

Julio Silva-Rodríguez, Ender Konukoglu

arXiv: 2603.02959v1 发布: 2026-03-03 更新: 2026-03-03

下载 PDF arXiv 页面

AI 摘要

针对医学图像分类小样本学习中的类别不平衡问题，提出一种半监督方法，利用无标签数据提升模型性能。

主要贡献

提出一种基于文本信息伪标签传播的半监督学习方法
应用于医学图像分类的小样本学习
降低了标注成本，在低样本情况下减少了>50%的标注工作

方法论

利用VLM预训练的embedding，通过文本信息生成伪标签，并利用这些伪标签在小样本学习过程中提升模型性能。

原文摘要

Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.

arXiv 分类

cs.CV