🤖 AI Summary
This study addresses key challenges in extracting suicide-related social determinants of health (SDoH) from unstructured text: low extraction accuracy, severe long-tail distribution, weak temporal identification of critical stressors, and poor model interpretability. To tackle these, we propose a multi-stage large language model (LLM) framework integrating fine-grained contextual retrieval, explicit intermediate reasoning chain modeling, and task-oriented fine-tuning—balancing precision and transparency. The framework synergistically leverages BioBERT and DeepSeek-R1, augmented by prompt engineering and knowledge distillation into lightweight models to optimize performance-efficiency trade-offs. Experiments demonstrate significant improvements over baselines in both SDoH extraction and contextual recall. User evaluation confirms that our interpretable outputs enhance annotation efficiency by 23.6% and accuracy by 18.4%, providing reliable, trustworthy data support for early identification and intervention targeting high-risk individuals.
📝 Abstract
Background: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Methods: We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model's explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study. Results: We show that our proposed framework demonstrated performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability. Conclusions: Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies.