Multimodal Learning 相关度: 9/10

SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink, Malcolm Mielle
arXiv: 2603.18774v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

提出SEAR方法,高效微调视觉几何Transformer,用于RGB+热成像3D重建,提升多模态对齐效果。

主要贡献

  • 提出SEAR微调策略,提升RGB-T图像的3D重建效果
  • 构建新的RGB+Thermal数据集,用于多模态3D重建
  • 通过实验验证SEAR在低光照、浓烟等条件下的鲁棒性

方法论

微调预训练的视觉几何Transformer,使其适应RGB-T输入,在相对较小的数据集上训练。

原文摘要

Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.

标签

RGB-T 3D Reconstruction Transformer Multimodal Learning Fine-tuning

arXiv 分类

cs.CV