🤖 AI Summary
Clinical documentation burden exacerbates physician burnout. To address the scarcity of high-quality, privacy-compliant paired data for medical natural language processing, we introduce the first large-scale, open-source, and HIPAA-aligned dialogue–clinical note bidirectional dataset (>10,000 aligned pairs), covering 2,000+ ICD-10 codes. Unlike random synthetic approaches, our generation framework is grounded in real-world disease prevalence distributions and integrates large language model–based synthesis with rigorous clinical expert validation—ensuring factual accuracy, lexical diversity, and epidemiological representativeness. Empirical evaluation demonstrates substantial performance gains on both dialogue-to-note (Dial-2-Note) and note-to-dialogue (Note-2-Dial) generation tasks across multiple architectures. This dataset establishes a scalable, secure, and high-fidelity benchmark for medical NLP, effectively bridging a critical gap in publicly available, clinically validated paired resources.
📝 Abstract
Physicians spend significant time documenting clinical encounters, a burden that contributes to professional burnout. To address this, robust automation tools for medical documentation are crucial. We introduce MedSynth -- a novel dataset of synthetic medical dialogues and notes designed to advance the Dialogue-to-Note (Dial-2-Note) and Note-to-Dialogue (Note-2-Dial) tasks. Informed by an extensive analysis of disease distributions, this dataset includes over 10,000 dialogue-note pairs covering over 2000 ICD-10 codes. We demonstrate that our dataset markedly enhances the performance of models in generating medical notes from dialogues, and dialogues from medical notes. The dataset provides a valuable resource in a field where open-access, privacy-compliant, and diverse training data are scarce. Code is available at https://github.com/ahmadrezarm/MedSynth/tree/main and the dataset is available at https://huggingface.co/datasets/Ahmad0067/MedSynth.