Multimodal Learning 相关度: 9/10

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota
arXiv: 2603.19039v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

TerraScope提出了一个像素级视觉推理的VLM,用于地球观测任务。

主要贡献

  • 提出TerraScope模型,支持像素级地理空间推理
  • 构建Terra-CoT数据集,包含百万级别像素级标注样本
  • 构建TerraScope-Bench基准测试,评估像素级推理能力

方法论

TerraScope通过统一的VLM框架,融合多模态和多时序数据,实现像素级的精准推理,并使用CoT数据进行训练。

原文摘要

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

标签

地球观测 视觉语言模型 像素级推理 多模态学习 多时序分析

arXiv 分类

cs.CV