AI Agents 相关度: 5/10

Synthesizing Realistic Test Data without Breaking Privacy

Laura Plein, Alexi Turcotte, Arina Hallemans, Andreas Zeller
arXiv: 2602.05833v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

提出了一种基于fuzzer和判别器的隐私保护合成数据生成方法,提高数据效用性和隐私性。

主要贡献

  • 提出基于fuzzer和判别器生成合成数据
  • 在生成过程中间接利用原始数据,保护隐私
  • 实验证明该方法在保证高数据效用性的同时保护隐私

方法论

使用fuzzer生成数据,判别器判断与原始数据的相似度,通过进化样本生成隐私数据。

原文摘要

There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data; a discriminator model determines how close we are to the original data. By evolving samples and determining "good samples" with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset, leading to a similar utility as the original data. We evaluated our approach on four datasets that have been used to evaluate the state-of-the-art techniques. Our experiments highlight the potential of our approach towards generating synthetic datasets that have high utility while preserving privacy.

标签

合成数据 隐私保护 生成对抗网络 数据生成

arXiv 分类

cs.LG