🤖 AI Summary
This work addresses the challenge of generating fine-grained structured radiology reports, particularly hindered by the scarcity of structured supervision data for rare findings and their attributes. The authors propose a novel approach that first extracts implicit image-associated knowledge from large-scale free-text reports to construct a multimodal knowledge base, which is then aligned with a structured template to form visual prototypes. Leveraging an instruction-tuned large language model, the method performs knowledge extraction, prototype construction, and retrieval, augmented by a prototype-conditioned residual mechanism for data-driven prediction refinement. Evaluated on the Rad-ReStruct benchmark, the proposed method achieves state-of-the-art performance, demonstrating substantial improvements over existing approaches, especially in fine-grained attribute generation.
📝 Abstract
Structured radiology reporting promises faster, more consistent communication than free text, but automation remains difficult as models must make many fine-grained, discrete decisions about rare findings and attributes from limited structured supervision. In contrast, free-text reports are produced at scale in routine care and implicitly encode fine-grained, image-linked information through detailed descriptions. To leverage this unstructured knowledge, we propose ProtoSR, an approach for injecting free-text information into structured report population. First, we introduce an automatic extraction pipeline that uses an instruction-tuned LLM to mine 80k+ MIMIC-CXR studies and build a multimodal knowledge base aligned with a structured reporting template, representing each answer option with a visual prototype. Using this knowledge base, ProtoSR is trained to retrieve prototypes relevant for the current image-question pair and augment the model predictions through a prototype-conditioned residual, providing a data-driven second opinion that selectively corrects predictions. On the Rad-ReStruct benchmark, ProtoSR achieves state-of-the-art results, with the largest improvements on detailed attribute questions, demonstrating the value of integrating free-text derived signal for fine-grained image understanding.