Multimodal Learning 相关度: 9/10

MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

Yimin Wei, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya

arXiv: 2603.17528v1 发布: 2026-03-18 更新: 2026-03-18

下载 PDF arXiv 页面

AI 摘要

提出MM-OVSeg，一个基于光图和SAR图像融合的遥感开放词汇分割框架，解决恶劣天气下的分割问题。

主要贡献

提出了一种跨模态统一流程，用于多传感器表征对齐。
设计了一个双编码器融合模块，集成了多视觉基础模型的分层特征。
在恶劣天气条件下，实现了鲁棒性和泛化能力上的提升。

方法论

利用光图和SAR图像的互补优势，通过跨模态统一和双编码器融合，实现多模态特征对齐和文本对齐分割。

原文摘要

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类