Multimodal Learning 相关度: 9/10

RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting

Linrui Xu, Zhongan Wang, Fei Shen, Gang Xu, Huiping Zhuang, Ming Li, Haifeng Li

arXiv: 2603.14941v1 发布: 2026-03-16 更新: 2026-03-16

下载 PDF arXiv 页面

AI 摘要

RS-WorldModel统一遥感理解与未来预测，提出新数据集RSWBench-1.1M并超越现有模型。

主要贡献

提出统一遥感世界模型RS-WorldModel
构建大规模遥感数据集RSWBench-1.1M
在遥感理解和未来预测任务上超越现有模型

方法论

采用三阶段训练：地理感知生成预训练、协同指令微调、可验证强化优化，提升模型性能。

原文摘要

Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120$ \times $ larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).

arXiv 分类

cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类