🤖 AI Summary
To address high computational costs, domain mismatch, and static knowledge limitations in multimodal depression detection—particularly those arising from conventional affective analysis—this paper proposes a Retrieval-Augmented Generation (RAG)-driven emotional prompting mechanism. The method fuses textual, audio, and visual modalities, dynamically retrieves affect-relevant knowledge from an external emotional knowledge base, and leverages large language models to generate interpretable emotional prompts, thereby enhancing cross-domain affective representation. It effectively mitigates domain shift while improving model generalizability and interpretability. Evaluated on the AVEC 2019 dataset, it achieves state-of-the-art performance (Concordance Correlation Coefficient = 0.593, Mean Absolute Error = 3.95), significantly outperforming existing transfer learning and multi-task learning approaches. The core contribution is the first application of the RAG paradigm to multimodal depression detection, enabling dynamic, interpretable, and minimally supervised affective modeling.
📝 Abstract
Multimodal deep learning has shown promise in depression detection by integrating text, audio, and video signals. Recent work leverages sentiment analysis to enhance emotional understanding, yet suffers from high computational cost, domain mismatch, and static knowledge limitations. To address these issues, we propose a novel Retrieval-Augmented Generation (RAG) framework. Given a depression-related text, our method retrieves semantically relevant emotional content from a sentiment dataset and uses a Large Language Model (LLM) to generate an Emotion Prompt as an auxiliary modality. This prompt enriches emotional representation and improves interpretability. Experiments on the AVEC 2019 dataset show our approach achieves state-of-the-art performance with CCC of 0.593 and MAE of 3.95, surpassing previous transfer learning and multi-task learning baselines.