🤖 AI Summary
This study addresses the low efficiency and high cost of manual annotation in structuring free-text narratives from U.S. National Violent Death Reporting System (NVDRS) suicide death reports. We propose a human-in-the-loop, closed-loop learning framework that integrates large language models (LMs) to automatically extract 50 key variables and incorporates expert feedback to iteratively refine annotation guidelines—enabling domain experts to focus on error correction and disagreement analysis. Our contributions include: (1) the first empirical identification that 38% of model prediction disagreements reflect genuine inter-annotator variability, validating their utility for active error detection; (2) an 85% agreement rate between LM predictions and human annotations, with newly extracted variables achieving quality comparable to fully manual annotation. The framework significantly enhances efficiency, accuracy, and interpretability in processing sensitive textual data, offering a scalable, high-fidelity, low-burden paradigm for structured data curation in public health.
📝 Abstract
Warning: This paper discusses topics of suicide and suicidal ideation, which may be distressing to some readers.
The National Violent Death Reporting System (NVDRS) documents information about suicides in the United States, including free text narratives (e.g., circumstances surrounding a suicide). In a demanding public health data pipeline, annotators manually extract structured information from death investigation records following extensive guidelines developed painstakingly by experts. In this work, we facilitate data-driven insights from the NVDRS data to support the development of novel suicide interventions by investigating the value of language models (LMs) as efficient assistants to these (a) data annotators and (b) experts. We find that LM predictions match existing data annotations about 85% of the time across 50 NVDRS variables. In the cases where the LM disagrees with existing annotations, expert review reveals that LM assistants can surface annotation discrepancies 38% of the time. Finally, we introduce a human-in-the-loop algorithm to assist experts in efficiently building and refining guidelines for annotating new variables by allowing them to focus only on providing feedback for incorrect LM predictions. We apply our algorithm to a real-world case study for a new variable that characterizes victim interactions with lawyers and demonstrate that it achieves comparable annotation quality with a laborious manual approach. Our findings provide evidence that LMs can serve as effective assistants to public health researchers who handle sensitive data in high-stakes scenarios.