SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the scarcity of large-scale, symptom-level annotated data in mental health research—a limitation stemming from the high cost of expert annotation and the absence of standardized diagnostic guidelines—which hinders models’ ability to generalize across diverse symptom expressions. To overcome this, we propose SynSym, a novel framework that systematically integrates fine-grained symptom subconcepts, multi-style linguistic expressions, and clinical co-occurrence patterns to generate high-fidelity, diverse synthetic symptom data using large language models. Experiments demonstrate that models trained solely on SynSym-generated data achieve performance comparable to those trained on real-world data across three depression symptom benchmarks. Furthermore, fine-tuning with only a small amount of real data yields additional gains, confirming SynSym’s effectiveness in enhancing model generalization while substantially reducing reliance on costly human annotations.

Technology Category

Application Category

📝 Abstract

Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users' mental states. However, the construction of large-scale symptom-level datasets remains challenging due to the resource-intensive nature of expert labeling and the lack of standardized annotation guidelines, which in turn limits the generalizability of models to identify diverse symptom expressions from user-generated text. To address these issues, we propose SynSym, a synthetic data generation framework for constructing generalizable datasets for symptom identification. Leveraging large language models (LLMs), SynSym constructs high-quality training samples by (1) expanding each symptom into sub-concepts to enhance the diversity of generated expressions, (2) producing synthetic expressions that reflect psychiatric symptoms in diverse linguistic styles, and (3) composing realistic multi-symptom expressions, informed by clinical co-occurrence patterns. We validate SynSym on three benchmark datasets covering different styles of depressive symptom expression. Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data, and benefit further from additional fine-tuning with real data. These findings underscore the potential of synthetic data as an alternative resource to real-world annotations in psychiatric symptom modeling, and SynSym serves as a practical framework for generating clinically relevant and realistic symptom expressions.

Problem

Research questions and friction points this paper is trying to address.

psychiatric symptom identification

synthetic data generation

mental health

social media

dataset construction

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation

psychiatric symptom identification

large language models