LLM Memory & RAG 相关度: 8/10

Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

Adam Jakobsen, Sushant Gautam, Hugo Lewi Hammer, Susanne Olofsdotter, Miriam S Johanson, Pål Halvorsen, Vajira Thambawita
arXiv: 2603.25186v1 发布: 2026-03-26 更新: 2026-03-26

AI 摘要

提出一种知识引导的检索增强生成框架,用于生成保护隐私的精神科合成数据。

主要贡献

  • 提出基于LLM的零样本精神科合成数据生成方法
  • 利用DSM-5和ICD-10知识库引导LLM生成
  • 验证了生成数据在六种焦虑症上的有效性,并分析了隐私性

方法论

利用检索增强生成技术(RAG),结合DSM-5和ICD-10知识库,引导LLM生成合成精神科数据,并与CTGAN和TVAE进行对比。

原文摘要

AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, we present a zero-shot, knowledge-guided framework for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD-10). We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data. The resulting models were benchmarked against two state-of-the-art deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. Evaluation was performed on six anxiety-related disorders: specific phobia, social anxiety disorder, agoraphobia, generalized anxiety disorder, separation anxiety disorder, and panic disorder. CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety. An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a no-retrieval LLM. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score. Overall, grounding an LLM in clinical knowledge enables high-quality, privacy-preserving synthetic psychiatric data when real datasets are unavailable or cannot be shared.

标签

RAG LLM 合成数据 隐私保护 精神科数据

arXiv 分类

cs.LG