π€ AI Summary
High-resolution computed tomography (HRCT) images exhibit diverse pathological patterns and spatial sparsity, posing significant challenges for the automatic generation of accurate diagnostic reports. To address this, this work proposes AbSteering, a novel framework that, for the first time, effectively adapts general-purpose video-language models to 3D medical image interpretation. The approach introduces an abnormality-centric chain-of-thought mechanism to guide report generation and incorporates a direct preference optimization objective based on clinically confusable abnormalities to enhance fine-grained discriminative capability. Experimental results demonstrate that AbSteering substantially outperforms existing specialized CT foundation models in both detection sensitivity and hallucination suppression, thereby validating the strong transferability and practical utility of general video-language models in medical image understanding.
π Abstract
Generating precise diagnostic reports from High-Resolution Computed Tomography (HRCT) is critical for clinical workflow, yet it remains a formidable challenge due to the high pathological diversity and spatial sparsity within 3D volumes. While Video Language Models (VideoLMs) have demonstrated remarkable spatio-temporal reasoning in general domains, their adaptability to domain-specific, high-volume medical interpretation remains underexplored. In this work, we present AbSteering, an abnormality-centric framework that steers VideoLMs toward precise HRCT report generation. Specifically, AbSteering introduces: (i) an abnormality-centric Chain-of-Thought scheme that enforces abnormality reasoning, and (ii) a Direct Preference Optimization objective that utilizes clinically confusable abnormalities as hard negatives to enhance fine-grained discrimination. Our results demonstrate that general-purpose VideoLMs possess strong transferability to high-volume medical imaging when guided by this paradigm. Notably, AbSteering outperforms state-of-the-art domain-specific CT foundation models, which are pretrained with large-scale CTs, achieving superior detection sensitivity while simultaneously mitigating hallucinations. Our data and model weights are released at https://anonymous.4open.science/r/hrct-report-generation-video-vlm-728C/