Evaluating LLMs for Anxiety, Depression, and Stress Detection Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mental health text analysis faces challenges including difficulty in distinguishing anxiety, depression, and stress; severe class imbalance; and implicit, context-dependent expressions. Method: This study systematically benchmarks large language models (GPT, Llama) against mainstream pretrained models (BERT, XLNet, Distil-RoBERTa) on the DAIC-WOZ clinical interview dataset, and proposes a zero-shot generative data augmentation method to synthesize high-fidelity, label-balanced samples. Contribution/Results: The synthetic data significantly alleviates label sparsity, improves minority-class recall, and enhances cross-task generalization. Experiments show Distil-RoBERTa achieves an F1-score of 0.883 on the GAD-2 task, XLNet attains 0.891 on the PHQ task, and synthetic data elevates stress detection performance to F1 = 0.884 and ROC AUC = 0.886. The work underscores the critical role of prompt engineering and controllable synthetic data construction in fine-grained mental state identification, offering a reproducible, low-resource NLP framework for mental health applications.

Technology Category

Application Category

📝 Abstract
Mental health disorders affect over one-fifth of adults globally, yet detecting such conditions from text remains challenging due to the subtle and varied nature of symptom expression. This study evaluates multiple approaches for mental health detection, comparing Large Language Models (LLMs) such as Llama and GPT with classical machine learning and transformer-based architectures including BERT, XLNet, and Distil-RoBERTa. Using the DAIC-WOZ dataset of clinical interviews, we fine-tuned models for anxiety, depression, and stress classification and applied synthetic data generation to mitigate class imbalance. Results show that Distil-RoBERTa achieved the highest F1 score (0.883) for GAD-2, while XLNet outperformed others on PHQ tasks (F1 up to 0.891). For stress detection, a zero-shot synthetic approach (SD+Zero-Shot-Basic) reached an F1 of 0.884 and ROC AUC of 0.886. Findings demonstrate the effectiveness of transformer-based models and highlight the value of synthetic data in improving recall and generalization. However, careful calibration is required to prevent precision loss. Overall, this work emphasizes the potential of combining advanced language models and data augmentation to enhance automated mental health assessment from text.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for detecting anxiety depression stress
Comparing transformer models with classical machine learning
Using synthetic data to address class imbalance issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning transformer models for mental health classification
Applying synthetic data generation to address class imbalance
Combining language models with data augmentation techniques
🔎 Similar Papers
No similar papers found.