Multimodal Learning 相关度: 8/10

Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation

ByeongCheol Lee, Hyun Seok Seong, Sangeek Hyun, Gilhan Park, WonJun Moon, Jae-Pil Heo
arXiv: 2603.23030v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

针对训练自由的开放词汇语义分割,提出一种全局-局部对齐的CLIP模型,解决窗口间的语义差异问题。

主要贡献

  • 提出Global-Local Aligned CLIP (GLA-CLIP)框架,实现窗口间的信息交互
  • 引入代理锚点 (Proxy Anchor),提供统一的语义参考,缓解窗口偏差
  • 提出动态归一化方案,根据对象尺度调整注意力强度,提升小目标分割效果

方法论

通过扩展key-value tokens引入全局上下文,并利用代理锚点和动态归一化来解决窗口偏差和小目标问题。

原文摘要

A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA-CLIP.

标签

语义分割 CLIP 开放词汇 Zero-shot learning 计算机视觉

arXiv 分类

cs.CV cs.AI