Multimodal Learning 相关度: 9/10

Grounding Synthetic Data Generation With Vision and Language Models

Ümit Mert Çağlar, Alptekin Temizel

arXiv: 2603.09625v1 发布: 2026-03-10 更新: 2026-03-10

下载 PDF arXiv 页面

AI 摘要

提出基于视觉-语言模型的合成数据生成和评估框架，用于遥感图像增强，并构建了ARAS400k数据集。

主要贡献

提出基于视觉-语言模型的合成数据生成和评估框架
构建大规模遥感增强数据集ARAS400k
验证了合成数据在语义分割和图像描述任务中的有效性

方法论

结合生成模型、语义分割、图像描述与视觉-语言模型，通过分析语义组成、最小化描述冗余、验证跨模态一致性评估合成数据。

原文摘要

Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at zenodo.org/records/18890661 and the code base at github.com/caglarmert/ARAS400k.

arXiv 分类

cs.CV cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类