🤖 AI Summary
This study systematically evaluates the clinical fidelity of synthetic PTSD Prolonged Exposure (PE) therapy dialogues to determine their suitability as substitutes for real clinical data in model training and evaluation. Method: We propose the first PE-specific fidelity assessment framework, integrating linguistic profiling, dialogue structure modeling, protocol compliance checking, and semantic similarity analysis—introducing protocol-sensitive linguistic and semantic metrics that transcend conventional fluency-oriented evaluation. Contribution/Results: Experiments show synthetic dialogues replicate basic structural properties (e.g., turn-taking ratio: 0.98 vs. 0.99 in real data) but critically underperform on core clinical dimensions—including dynamic distress monitoring and phase-appropriate therapeutic alignment—exposing fundamental blind spots in current synthetic-data evaluation paradigms. Our work establishes a reusable methodological benchmark and identifies concrete directions for improving clinical dialogue generation and fidelity assessment.
📝 Abstract
The growing adoption of synthetic data in healthcare is driven by privacy concerns, limited access to real-world data, and the high cost of annotation. This work explores the use of synthetic Prolonged Exposure (PE) therapeutic conversations for Post-Traumatic Stress Disorder (PTSD) as a scalable alternative for training and evaluating clinical models. We systematically compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics, including turn-taking patterns and treatment fidelity. We also introduce and evaluate PE-specific metrics derived from linguistic analysis and semantic modeling, offering a novel framework for assessing clinical fidelity beyond surface fluency. Our findings show that although synthetic data holds promise for mitigating data scarcity and protecting patient privacy, it can struggle to capture the subtle dynamics of therapeutic interactions. In our dataset, synthetic dialogues match structural features of real-world dialogues (e.g., speaker switch ratio: 0.98 vs. 0.99), however, synthetic interactions do not adequately reflect key fidelity markers (e.g., distress monitoring). We highlight gaps in existing evaluation frameworks and advocate for fidelity-aware metrics that go beyond surface fluency to uncover clinically significant failures. Our findings clarify where synthetic data can effectively complement real-world datasets -- and where critical limitations remain.