🤖 AI Summary
Current vision-language models for CT report generation rely on global embeddings, which are prone to hallucination and lack fine-grained modeling of three-dimensional anatomical structures. This work proposes a hypothesis-driven, iterative evidence acquisition framework in which a large language model dynamically invokes lesion-specific 3D feature extraction tools to align local voxel-level evidence with textual descriptions within a multidimensional retrieval space, thereby adopting a “accumulate quantitative evidence before generating reports” paradigm. By introducing an agent-based workflow into medical imaging report generation for the first time, the method achieves state-of-the-art performance on the CT-RATE and RadChestCT datasets without requiring model fine-tuning, significantly outperforming existing 2D and 3D approaches in clinical accuracy, factual consistency, and interpretability.
📝 Abstract
Vision-language models (VLMs) have shown potential for automated radiology report generation, yet existing approaches rely on global embedding compression of volumetric data, often leading to hallucinated findings and limited anatomical grounding in 3D CT imaging. We introduce MedScribe, a hypothesis-driven framework that reformulates report generation as an iterative evidence acquisition process rather than a single-pass encoding task. MedScribe models reporting as a sequential decision process in which a large language model dynamically invokes pathology-specific diagnostic tools to extract localized volumetric features. These structured features are used to query a multidimensional retrieval space aligned with pathology-specific textual evidence. By explicitly accumulating quantitative evidence prior to synthesis, the framework enforces fine-grained grounding and reduces unsupported claims. Without task-specific fine-tuning, MedScribe improves clinical accuracy, factual consistency, and interpretability on CT-RATE and RadChestCT compared to state-of-the-art 2D and 3D VLMs, demonstrating the value of hypothesis-driven reasoning for reliable medical image reporting.