Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the persistent challenges in psychiatric AI research stemming from the difficulty of accessing real patient data and associated privacy risks. To overcome these limitations, the authors propose a novel clinical knowledge–guided retrieval-augmented generation (RAG) framework that leverages the DSM-5 and ICD-10 diagnostic criteria to steer large language models in generating structurally faithful synthetic tabular data for psychiatric disorders—without requiring any real patient records. Evaluated on six anxiety disorder categories, the method achieves pairwise structural fidelity comparable to or better than established generative baselines such as CTGAN and TVAE, while substantially mitigating privacy leakage risks through its zero-shot, knowledge-driven design.

Technology Category

Application Category

📝 Abstract
AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, we present a zero-shot, knowledge-guided framework for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD-10). We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data. The resulting models were benchmarked against two state-of-the-art deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. Evaluation was performed on six anxiety-related disorders: specific phobia, social anxiety disorder, agoraphobia, generalized anxiety disorder, separation anxiety disorder, and panic disorder. CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety. An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a no-retrieval LLM. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score. Overall, grounding an LLM in clinical knowledge enables high-quality, privacy-preserving synthetic psychiatric data when real datasets are unavailable or cannot be shared.
Problem

Research questions and friction points this paper is trying to address.

synthetic data generation
privacy preservation
psychiatric data
zero-shot learning
healthcare AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
Zero-Shot Synthetic Data
Privacy-Preserving AI
Knowledge-Guided LLM
Psychiatric Tabular Data
🔎 Similar Papers
No similar papers found.
A
Adam Jakobsen
SimulaMet, Norway
S
Sushant Gautam
SimulaMet, Norway
H
Hugo Lewi Hammer
SimulaMet, Norway
S
Susanne Olofsdotter
Uppsala University, Sweden
M
Miriam S Johanson
Oslo Metropolitan University, Norway
Pål Halvorsen
Pål Halvorsen
SimulaMet, Simula Research Laboratory, Oslo Metropolitan University (OsloMet), University of Oslo
Multimedia systemsMedical Multimedia SystemsSport SystemsApplied Machine Learning
Vajira Thambawita
Vajira Thambawita
SimulaMet
GPGPU Parallel ComputingEmbedded SystemsMachine LearningDeep Learning