🤖 AI Summary
Existing Social and Behavioral Determinants of Health (SBDH) datasets for clinical text are scarce, narrowly scoped, and prohibitively expensive to annotate—severely hindering automated SBDH identification in real-world clinical NLP. Method: We introduce Synth-SBDH, the first synthetic, clinically oriented SBDH dataset covering 15 SBDH categories, featuring novel fine-grained annotations across three dimensions: state, temporal scope, and reasoning evidence. Our framework leverages large language models (LLMs) with controllable generation and posterior verification, integrating real-text distribution modeling and structured prompt engineering. Contribution/Results: Synth-SBDH demonstrates strong generalization under few-shot, long-tail, and cross-institutional settings. On three real clinical NLP tasks, it boosts macro-F1 by up to 63.75%. Human evaluation confirms 71.06% alignment between LLM-generated and expert annotations. Crucially, annotation cost is substantially lower than manual curation—enabling scalable, high-quality SBDH data construction for clinical AI.
📝 Abstract
Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 63.75% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints while being substantially cheaper than expert-annotated real-world data. Human evaluation reveals a 71.06% Human-LLM alignment and uncovers areas for future refinements.