A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications

πŸ“… 2025-06-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Biomedical data scarcity, privacy sensitivity, and quality heterogeneity necessitate high-fidelity synthetic data generation. This study systematically reviews 59 works (2020–2025) addressing synthetic data generation for text, tabular, and multimodal biomedical data, clinical applications, and evaluation practices. Conducted per PRISMA-ScR guidelines across PubMed, ACM, IEEE Xplore, and SpringerLink, it is the first panoramic scoping review in this domain. Methodologically, it identifies a paradigm shift toward prompt engineering (72.9% of studies) and human-in-the-loop evaluation (55.9%). Contributions include: (1) a unified framework integrating LLM-based prompting, fine-tuning, and domain-specific architectures; and (2) a heterogeneous validation suite combining intrinsic metrics, expert human assessment, and LLM-assisted evaluation. Empirical findings reveal three key patterns: modality distribution (78.0% text-centric), method composition trends, and evaluation preferences. Critical bottlenecks are identified as clinical deployability, computational resource accessibility, and lack of standardized evaluation protocols.

Technology Category

Application Category

πŸ“ Abstract
Synthetic data generation--mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields--has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically examines biomedical research and application trends in synthetic data generation, emphasizing clinical applications, methodologies, and evaluations. Our analysis identifies data modalities of unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%); generation methods of prompting (72.9%), fine-tuning (22.0%) LLMs and specialized model (5.1%); and heterogeneous evaluations of intrinsic metrics (27.1%), human-in-the-loop assessments (55.9%), and LLM-based evaluations (13.6%). The analysis addresses current limitations in what, where, and how health professionals can leverage synthetic data generation for biomedical domains. Our review also highlights challenges in adaption across clinical domains, resource and model accessibility, and evaluation standardizations.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in biomedical research using synthetic data
Overcoming privacy concerns in biomedical applications with synthetic data
Evaluating synthetic data generation methods and their clinical adaptability
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for synthetic data generation
Prompting and fine-tuning methods
Human-in-the-loop evaluation metrics
πŸ”Ž Similar Papers
No similar papers found.