Multimodal Learning 相关度: 8/10

physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection

Yuting Wan, Liguo Sun, Jiuwu Hao, Zao Zhang, Pin LV
arXiv: 2603.01947v1 发布: 2026-03-02 更新: 2026-03-02

AI 摘要

提出PhysFusion,利用雷达和视觉信息融合,提升水面目标检测精度和鲁棒性。

主要贡献

  • 提出物理信息雷达编码器(PIR Encoder)
  • 设计雷达引导的交互式融合模块(RIFM)
  • 引入时序查询聚合模块(TQA)

方法论

基于Transformer的双流架构,融合雷达散射先验和多尺度视觉特征,并利用时序信息增强目标表示。

原文摘要

Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.

标签

水面目标检测 雷达视觉融合 Transformer USV

arXiv 分类

cs.CV cs.AI