Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations

πŸ“… 2025-04-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
AI development for trauma therapy is hindered by a scarcity of high-quality, clinically grounded dialogue data. Method: We introduce the first large-scale synthetic dialogue dataset (3,000 dialogues across 500 cases and six treatment phases) tailored to prolonged exposure therapy for PTSD. Our novel clinical-protocol-driven, multi-perspective emotional evolution synthesis framework integrates demographic attributes, 20 trauma types, and 10 trauma-related behavioral patterns. It jointly employs deterministic rules and probabilistic generation, embedding clinical guidelines, symptom comorbidity patterns, and empirically calibrated real-world prevalence sampling (e.g., witnessing violence: 10.6%; nightmares: 23.4%). Contribution/Results: We release an expert-validated emotional trajectory evaluation benchmark, ensuring privacy-preserving yet clinically faithful dialogues. Clinical experts confirm the dataset’s emotional depth and therapeutic fidelity, enabling quantifiable assessment of trauma-informed dialogue understanding and response generation.

Technology Category

Application Category

πŸ“ Abstract
The advancement of AI systems for mental health support is hindered by limited access to therapeutic conversation data, particularly for trauma treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset of 3,000 therapy conversations based on Prolonged Exposure therapy protocols for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique cases, each explored through six conversational perspectives that mirror the progression of therapy from initial anxiety to peak distress to emotional processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3, 49.4% male, 44.4% female, 6.2% non-binary), 20 trauma types, and 10 trauma-related behaviors using deterministic and probabilistic generation methods. Analysis reveals realistic distributions of trauma types (witnessing violence 10.6%, bullying 10.2%) and symptoms (nightmares 23.4%, substance abuse 20.8%). Clinical experts validated the dataset's therapeutic fidelity, highlighting its emotional depth while suggesting refinements for greater authenticity. We also developed an emotional trajectory benchmark with standardized metrics for evaluating model responses. This privacy-preserving dataset addresses critical gaps in trauma-focused mental health data, offering a valuable resource for advancing both patient-facing applications and clinician training tools.
Problem

Research questions and friction points this paper is trying to address.

Limited access to therapeutic conversation data for trauma treatment
Need for diverse synthetic dataset modeling PTSD therapy progression
Lack of standardized metrics for evaluating AI therapy responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset with 3000 PTSD therapy conversations
Diverse demographics and trauma types incorporated
Privacy-preserving emotional trajectory benchmark developed
πŸ”Ž Similar Papers
No similar papers found.
B
BN Suhas
College of Information Sciences and Technology, Penn State University, USA
D
Dominik Mattioli
College of Information Sciences and Technology, Penn State University, USA
Saeed Abdullah
Saeed Abdullah
Penn State
HCIDigital HealthmHealthHCAI
Rosa I. Arriaga
Rosa I. Arriaga
Associate Professor, Interactive Computing, Georgia Tech
hcimHealthsocial computingcognitive science
C
Chris W. Wiese
School of Psychology, Georgia Tech, USA
A
Andrew M. Sherrill
Department of Psychiatry and Behavioral Sciences, Emory University, USA