LLM Reasoning 相关度: 5/10

SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

Andrea Lampis, Michela Carlotta Massi, Nicola Pirastu, Francesca Ieva, Matteo Matteucci, Emanuele Di Angelantonio
arXiv: 2603.10873v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

SNPgen提出了一种基于条件潜在扩散的表型监督合成基因型生成框架。

主要贡献

  • 提出了两阶段条件潜在扩散框架SNPgen
  • 实现了表型监督的合成基因型生成
  • 在保护隐私的同时保持了基因数据的统计特性和下游任务的实用性

方法论

采用GWAS指导的变异选择、变分自编码器压缩和条件潜在扩散模型,结合classifier-free guidance。

原文摘要

Polygenic risk scores and other genomic analyses require large individual-level genotype datasets, yet strict data access restrictions impede sharing. Synthetic genotype generation offers a privacy-preserving alternative, but most existing methods operate unconditionally, producing samples without phenotype alignment, or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility. We present SNPgen, a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes. SNPgen combines GWAS-guided variant selection (1,024-2,048 trait-associated SNPs) with a variational autoencoder for genotype compression and a latent diffusion model conditioned on binary disease labels via classifier-free guidance. Evaluated on 458,724 UK Biobank individuals across four complex diseases (coronary artery disease, breast cancer, type 1 and type 2 diabetes), models trained on synthetic data matched real-data predictive performance in a train-on-synthetic, test-on-real protocol, approaching genome-wide PRS methods that use $2$-$6\times$ more variants. Privacy analysis confirmed zero identical matches, near-random membership inference (AUC $\approx 0.50$), preserved linkage disequilibrium structure, and high allele frequency correlation ($r \geq 0.95$) with source data. A controlled simulation with known causal effects verified faithful recovery of the imposed genetic association structure.

标签

合成数据生成 基因型 表型 扩散模型 隐私保护

arXiv 分类

cs.LG q-bio.GN