MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
AI 摘要
提出MM-OVSeg,一个基于光图和SAR图像融合的遥感开放词汇分割框架,解决恶劣天气下的分割问题。
主要贡献
- 提出了一种跨模态统一流程,用于多传感器表征对齐。
- 设计了一个双编码器融合模块,集成了多视觉基础模型的分层特征。
- 在恶劣天气条件下,实现了鲁棒性和泛化能力上的提升。
方法论
利用光图和SAR图像的互补优势,通过跨模态统一和双编码器融合,实现多模态特征对齐和文本对齐分割。
原文摘要
Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.