Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of converting unstructured transvaginal ultrasound reports of endometriosis into structured data to support radiomics workflows. We evaluated the performance of locally deployed large language models (LLMs) with 7B/8B and 20B parameters on clinical information extraction, leveraging prompt engineering and structured output evaluation metrics, and benchmarked results against annotations by human experts. The 20B-parameter model achieved an average accuracy of 86.02%, significantly outperforming smaller models, and demonstrated superior syntactic fidelity, whereas human experts exhibited stronger semantic comprehension. Our analysis revealed complementary error patterns between humans and LLMs, leading to the proposal of a synergistic workflow—“LLM pre-screening followed by expert semantic validation”—which offers a novel paradigm for human–AI collaboration in medical text structuring.

Technology Category

Application Category

📝 Abstract
In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows. Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction. The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text. Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation. We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. These findings strongly support a human-in-the-loop (HITL) workflow in which the on-premise LLM serves as a collaborative tool, not a full replacement. It automates routine structuring and flags potential human errors, enabling imaging specialists to focus on high-level semantic validation. We discuss implications for structured reporting and interactive AI systems in clinical practice.
Problem

Research questions and friction points this paper is trying to address.

structured extraction
endometriosis
transvaginal ultrasound
clinical text
imaging informatics
Innovation

Methods, ideas, or system contributions that make the work stand out.

large language model
structured extraction
human-in-the-loop
clinical text processing
local deployment
H
Haiyi Li
University of Adelaide
Y
Yutong Li
University of Adelaide
Yiheng Chi
Yiheng Chi
Purdue University
Computational ImagingDenoisingImage ProcessingComputer Vision
A
A. Deslandes
Robinson Research Institute, University of Adelaide
M
Mathew Leonardi
McMaster University, Robinson Research Institute
S
S. Freger
McMaster University, Robinson Research Institute
Yuan Zhang
Yuan Zhang
Postdoctoral research fellow, University of Adelaide
Deep learningComputer visionMedical Imaging AnalysisEndometriosis Diagnosis
J
Jodie Avery
Robinson Research Institute, University of Adelaide
M
M. Hull
Robinson Research Institute, University of Adelaide
H
Hsiang-Ting Chen
Australian Institute for Machine Learning, University of Adelaide