ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery
AI 摘要
ProVG通过解耦语言表达式,动态调节视觉注意力,实现遥感图像视觉定位的精度提升。
主要贡献
- 提出了一种新的遥感视觉定位框架ProVG。
- 引入 progressive cross-modal modulator 实现 coarse-to-fine 的视觉语言对齐。
- 设计 cross-scale fusion module 和 language-guided calibration decoder 进一步提升性能。
方法论
解耦语言表达式为全局上下文、空间关系和对象属性,通过动态跨模态调制器和多尺度融合实现视觉定位。
原文摘要
Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.