Multimodal Learning 相关度: 9/10

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
arXiv: 2602.23339v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

提出一种基于检索增强的测试时适配器,利用少量带标注样本提升开放词汇分割性能。

主要贡献

  • 提出检索增强测试时适配器,融合文本和视觉支持特征
  • 实现了学习型的、针对每个查询的特征融合,提升了模态协同
  • 证明了在少量样本设置下能显著缩小零样本和监督分割之间的差距

方法论

构建在少量样本设置上,通过检索增强的测试时适配器学习轻量级的图像分类器,融合文本和视觉特征。

原文摘要

Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

标签

开放词汇分割 视觉语言模型 少样本学习 检索增强

arXiv 分类

cs.CV