Multimodal Learning 相关度: 9/10

RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting

Linrui Xu, Zhongan Wang, Fei Shen, Gang Xu, Huiping Zhuang, Ming Li, Haifeng Li
arXiv: 2603.14941v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

RS-WorldModel统一遥感理解与未来预测,提出新数据集RSWBench-1.1M并超越现有模型。

主要贡献

  • 提出统一遥感世界模型RS-WorldModel
  • 构建大规模遥感数据集RSWBench-1.1M
  • 在遥感理解和未来预测任务上超越现有模型

方法论

采用三阶段训练:地理感知生成预训练、协同指令微调、可验证强化优化,提升模型性能。

原文摘要

Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120$ \times $ larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).

标签

遥感 世界模型 多模态 时空预测

arXiv 分类

cs.AI