Multimodal Learning 相关度: 9/10

MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

Yimin Wei, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya
arXiv: 2603.17528v1 发布: 2026-03-18 更新: 2026-03-18

AI 摘要

提出MM-OVSeg,一个基于光图和SAR图像融合的遥感开放词汇分割框架,解决恶劣天气下的分割问题。

主要贡献

  • 提出了一种跨模态统一流程,用于多传感器表征对齐。
  • 设计了一个双编码器融合模块,集成了多视觉基础模型的分层特征。
  • 在恶劣天气条件下,实现了鲁棒性和泛化能力上的提升。

方法论

利用光图和SAR图像的互补优势,通过跨模态统一和双编码器融合,实现多模态特征对齐和文本对齐分割。

原文摘要

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.

标签

遥感 多模态 开放词汇分割 光学图像 SAR

arXiv 分类

cs.CV