AI Agents 相关度: 8/10

Reasoning-Driven Synthetic Data Generation and Evaluation

Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous
arXiv: 2603.29791v1 发布: 2026-03-31 更新: 2026-03-31

AI 摘要

提出Simula框架,通过推理驱动生成和评估合成数据,解决数据稀缺问题。

主要贡献

  • 提出了Simula框架,一种推理驱动的合成数据生成方法
  • 提供了合成数据机制设计的指南
  • 探索了大规模合成数据生成和评估的方法

方法论

采用无种子、Agent方式生成合成数据,用户可控地定义数据集特征,并进行评估。

原文摘要

Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

标签

合成数据 推理 数据生成 数据评估

arXiv 分类

cs.AI cs.CL cs.LG