INESC-ID @ eRisk 2025: Exploring Fine-Tuned, Similarity-Based, and Prompt-Based Approaches to Depression Symptom Identification

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the eRisk 2025 task on depression symptom identification, aiming to precisely detect and rank user-generated sentences corresponding to each of the 21 items of the Beck Depression Inventory (BDI). Methodologically, each BDI item is treated as an independent binary classification problem. We propose a symptom-adaptive framework: (i) mitigating class imbalance via hybrid fine-tuned BERT-based models and LLM-generated (Llama/Phi) synthetic data; (ii) employing symptom-specific modeling strategies; and (iii) designing five evaluation configurations—including two ensemble variants—integrating sentence embedding matching, prompt engineering, and multi-paradigm fine-tuning. Our key contributions lie in fine-grained, symptom-level adaptive modeling and synergistic optimization of heterogeneous signals. Evaluated on the official eRisk 2025 benchmark, our system achieved first place among all 16 participating teams, attaining top performance in both Average Precision (AP) and R-Precision (R-PREC).

Technology Category

Application Category

📝 Abstract

In this work, we describe our team's approach to eRisk's 2025 Task 1: Search for Symptoms of Depression. Given a set of sentences and the Beck's Depression Inventory - II (BDI) questionnaire, participants were tasked with submitting up to 1,000 sentences per depression symptom in the BDI, sorted by relevance. Participant submissions were evaluated according to standard Information Retrieval (IR) metrics, including Average Precision (AP) and R-Precision (R-PREC). The provided training data, however, consisted of sentences labeled as to whether a given sentence was relevant or not w.r.t. one of BDI's symptoms. Due to this labeling limitation, we framed our development as a binary classification task for each BDI symptom, and evaluated accordingly. To that end, we split the available labeled data into training and validation sets, and explored foundation model fine-tuning, sentence similarity, Large Language Model (LLM) prompting, and ensemble techniques. The validation results revealed that fine-tuning foundation models yielded the best performance, particularly when enhanced with synthetic data to mitigate class imbalance. We also observed that the optimal approach varied by symptom. Based on these insights, we devised five independent test runs, two of which used ensemble methods. These runs achieved the highest scores in the official IR evaluation, outperforming submissions from 16 other teams.

Problem

Research questions and friction points this paper is trying to address.

Identify depression symptoms from text using BDI questionnaire

Evaluate sentence relevance for depression symptoms via IR metrics

Compare fine-tuning, similarity, and prompting methods for classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned foundation models for classification

Used sentence similarity techniques

Applied LLM prompting and ensembles

🔎 Similar Papers

No similar papers found.