Multimodal Learning 相关度: 9/10

Local-Global Prompt Learning via Sparse Optimal Transport

Deniz Kizaroğlu, Ülku Tuncer Küçüktas, Emre Çakmakyurdu, Alptekin Temizel
arXiv: 2603.08347v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

SOT-GLP通过稀疏最优传输实现局部-全局提示学习,提升视觉语言模型在少样本分类和OOD检测上的性能。

主要贡献

  • 提出了SOT-GLP方法,结合全局和局部提示学习
  • 利用V-V注意力构建类别条件稀疏patch集合
  • 使用平衡熵最优传输进行局部区域分配,避免提示重叠
  • 在少样本分类和OOD检测上取得了SOTA性能

方法论

SOT-GLP结合全局图像-文本匹配和局部稀疏patch与类别特定提示的对齐,利用最优传输实现区域划分。

原文摘要

Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: https://github.com/Deniz2304988/SOT-GLP

标签

视觉语言模型 提示学习 最优传输 少样本学习 OOD检测

arXiv 分类

cs.CV