Dynamic Summary Generation for Interpretable Multimodal Depression Detection

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Depression is frequently underdiagnosed due to stigma and reliance on subjective assessments. This work proposes a coarse-to-fine, multi-stage multimodal framework that, for the first time, integrates dynamically generated progressive clinical summaries from large language models into the depression detection pipeline. By jointly leveraging textual, acoustic, and visual features, the approach simultaneously performs binary classification, five-level severity grading, and continuous-value regression, while producing highly interpretable, comprehensive assessment reports. Evaluated on the E-DAIC and CMDC datasets, the method significantly outperforms current state-of-the-art approaches, achieving high diagnostic accuracy alongside enhanced model interpretability.

Technology Category

Application Category

📝 Abstract

Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection. The pipeline performs binary screening, five-class severity classification, and continuous regression. At each stage, an LLM produces progressively richer clinical summaries that guide a multimodal fusion module integrating text, audio, and video features, yielding predictions with transparent rationale. The system then consolidates all summaries into a concise, human-readable assessment report. Experiments on the E-DAIC and CMDC datasets show significant improvements over state-of-the-art baselines in both accuracy and interpretability.

Problem

Research questions and friction points this paper is trying to address.

depression detection

underdiagnosis

subjective symptom ratings

stigma

interpretable screening

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion

large language models

interpretable AI