Multimodal Learning 相关度: 9/10

LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation

Junyang Chen, Xiangbo Lv, Zhiqiang Kou, Xingdong Sheng, Ning Xu, Yiguo Qiao
arXiv: 2602.05578v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

LoGoSeg通过融合局部和全局特征,实现了高效且泛化性强的开放词汇语义分割。

主要贡献

  • 提出对象存在先验以减少幻觉
  • 引入区域感知对齐模块以建立区域级视觉-文本对应
  • 提出双流融合机制以结合局部结构信息和全局语义上下文

方法论

LoGoSeg使用单阶段框架,结合全局图像-文本相似性、区域感知对齐和双流融合机制,提升开放词汇语义分割效果。

原文摘要

Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.

标签

开放词汇语义分割 视觉语言模型 图像分割

arXiv 分类

cs.CV