Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification
AI 摘要
提出一种跨模态原型对齐与混合方法,提升CLIP在少样本分类任务中的性能。
主要贡献
- 提出混合图像和文本原型的方法作为收缩估计器
- 提出文本对齐的语义图像子空间,用于减少图像原型中的噪声
- 提出结合文本对齐混合原型分类器和图像特定LDA分类器
方法论
通过将图像原型投影到文本嵌入空间,实现跨模态对齐,然后将对齐后的图像原型与文本原型混合进行分类。
原文摘要
Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.