Multimodal Learning 相关度: 9/10

BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation

Johann-Ludwig Herzog, Mathis Jürgen Adler, Leonard Hackel, Yan Shu, Angelos Zavras, Ioannis Papoutsis, Paolo Rota, Begüm Demir
arXiv: 2603.29630v1 发布: 2026-03-31 更新: 2026-03-31

AI 摘要

提出了大规模多传感器遥感图像-文本数据集BigEarthNet.txt,用于提升遥感领域视觉-语言模型性能。

主要贡献

  • 构建了大规模多传感器遥感图像-文本数据集BigEarthNet.txt
  • 数据集包含多种类型的文本标注,包括地理锚定的描述、视觉问答对和指代表达式检测指令
  • 通过实验证明了现有VLM在遥感任务上的局限性,并验证了BigEarthNet.txt微调带来的性能提升

方法论

构建数据集,对比分析现有数据集,建立基准测试集,评估现有VLM模型,并使用BigEarthNet.txt进行微调以验证效果。

原文摘要

Vision-langugage models (VLMs) have shown strong performance in computer vision (CV), yet their performance on remote sensing (RS) data remains limited due to the lack of large-scale, multi-sensor RS image-text datasets with diverse textual annotations. Existing datasets predominantly include aerial Red-Green-Blue imagery, with short or weakly grounded captions, and provide limited diversity in annotation types. To address this limitation, we introduce BigEarthNet.txt, a large-scale, multi-sensor image-text dataset designed to advance instruction-driven image-text learning in Earth observation across multiple tasks. BigEarthNet.txt contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction. Through a comparative statistical analysis, we demonstrate that BigEarthNet.txt surpasses existing RS image-text datasets in textual richness and annotation type variety. We further establish a manually-verified benchmark split to evaluate VLMs in RS and CV. The results show the limitations of these models on tasks that involve complex LULC classes, whereas fine-tuning using BigEarthNet.txt results in consistent performance gains across all considered tasks.

标签

遥感 图像-文本 多模态学习 数据集 视觉-语言模型

arXiv 分类

cs.CV