Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This study addresses the challenge of converting unstructured transvaginal ultrasound reports of endometriosis into structured data to support radiomics workflows. We evaluated the performance of locally deployed large language models (LLMs) with 7B/8B and 20B parameters on clinical information extraction, leveraging prompt engineering and structured output evaluation metrics, and benchmarked results against annotations by human experts. The 20B-parameter model achieved an average accuracy of 86.02%, significantly outperforming smaller models, and demonstrated superior syntactic fidelity, whereas human experts exhibited stronger semantic comprehension. Our analysis revealed complementary error patterns between humans and LLMs, leading to the proposal of a synergistic workflow—“LLM pre-screening followed by expert semantic validation”—which offers a novel paradigm for human–AI collaboration in medical text structuring.

Technology Category

Application Category

📝 Abstract

In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows. Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction. The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text. Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation. We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. These findings strongly support a human-in-the-loop (HITL) workflow in which the on-premise LLM serves as a collaborative tool, not a full replacement. It automates routine structuring and flags potential human errors, enabling imaging specialists to focus on high-level semantic validation. We discuss implications for structured reporting and interactive AI systems in clinical practice.

Problem

Research questions and friction points this paper is trying to address.

structured extraction

endometriosis

transvaginal ultrasound

clinical text

imaging informatics

Innovation

Methods, ideas, or system contributions that make the work stand out.

large language model

structured extraction

human-in-the-loop