🤖 AI Summary
Clinical dialogue data are difficult to access due to privacy and governance constraints, while existing synthetic data lack theoretical grounding, limiting their generalizability in medical NLP tasks. To address this, we propose the first taxonomy framework specifically designed for synthetic clinical dialogue data, structured along multiple dimensions—including synthesis granularity, knowledge injection mechanisms, and evaluation paradigms—to establish a hierarchical typology. This framework formally defines synthetic data types and degrees, enabling cross-dataset comparability and method–task alignment, thereby filling a critical theoretical gap in clinical dialogue synthesis. We further provide reusable classification criteria and an evaluation guideline to ensure controllable data quality and predictable task generalization. Empirical validation across multiple clinical NLP dialogue tasks confirms the framework’s effectiveness in guiding high-quality, task-adaptive synthetic data generation.
📝 Abstract
Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.