🤖 AI Summary
Structured annotation of abdominal CT reports faces challenges including numerous anatomical organs, diverse abnormality types, and clinical urgency requiring joint assessment—existing methods struggle to balance fine-grained granularity and reliability.
Method: We propose the first structured label generation system for abdominal CT, supporting dual-dimensional (presence and urgency) automatic annotation across 9 organs and 7 abnormality categories. Our approach introduces a novel tree-structured chain-of-thought prompting framework that integrates sentence-level extraction with multi-choice decision making, enabling end-to-end inference on locally deployed large language models.
Contribution/Results: The system achieves an average F1 score of 0.89; urgency assessments align with expert consensus. Generated labels successfully train a unified vision model capable of simultaneous multi-organ abnormality detection. We publicly release both the source code and a high-quality, structured dataset comprising over 1,000 cases—filling a critical gap in fine-grained, anatomy-region–specific supervised learning for abdominal imaging.
📝 Abstract
Extracting structured labels from radiology reports has been employed to create vision models to simultaneously detect several types of abnormalities. However, existing works focus mainly on the chest region. Few works have been investigated on abdominal radiology reports due to more complex anatomy and a wider range of pathologies in the abdomen. We propose LEAVS (Large language model Extractor for Abdominal Vision Supervision). This labeler can annotate the certainty of presence and the urgency of seven types of abnormalities for nine abdominal organs on CT radiology reports. To ensure broad coverage, we chose abnormalities that encompass most of the finding types from CT reports. Our approach employs a specialized chain-of-thought prompting strategy for a locally-run LLM using sentence extraction and multiple-choice questions in a tree-based decision system. We demonstrate that the LLM can extract several abnormality types across abdominal organs with an average F1 score of 0.89, significantly outperforming competing labelers and humans. Additionally, we show that extraction of urgency labels achieved performance comparable to human annotations. Finally, we demonstrate that the abnormality labels contain valuable information for training a single vision model that classifies several organs as normal or abnormal. We release our code and structured annotations for a public CT dataset containing over 1,000 CT volumes.