Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
AI 摘要
提出一种基于检索增强的测试时适配器,利用少量带标注样本提升开放词汇分割性能。
主要贡献
- 提出检索增强测试时适配器,融合文本和视觉支持特征
- 实现了学习型的、针对每个查询的特征融合,提升了模态协同
- 证明了在少量样本设置下能显著缩小零样本和监督分割之间的差距
方法论
构建在少量样本设置上,通过检索增强的测试时适配器学习轻量级的图像分类器,融合文本和视觉特征。
原文摘要
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.