Multimodal Learning 相关度: 9/10

VIPA: Visual Informative Part Attention for Referring Image Segmentation

Yubin Cho, Hyunwoo Yu, Kyeongbo Kong, Kyomin Sohn, Bongjoon Hyun, Suk-Ju Kang
arXiv: 2602.14788v1 发布: 2026-02-16 更新: 2026-02-16

AI 摘要

提出VIPA框架,通过视觉信息部分注意力机制提升指代图像分割精度。

主要贡献

  • 提出VIPA框架,利用视觉信息部分注意力进行图像分割
  • 设计视觉表达式生成器(VEG),提取信息丰富的视觉tokens
  • 在四个公开数据集上超过现有最优方法

方法论

利用VEG模块,通过局部-全局语言线索提取信息视觉tokens,并精炼减少噪声,实现细粒度区域对齐。

原文摘要

Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.

标签

指代图像分割 视觉信息部分注意力 跨模态学习 视觉表达式生成

arXiv 分类

cs.CV cs.AI