🤖 AI Summary
This work proposes the first vision-instruction-tuned framework for 3D CT image–language understanding to address the limitations of traditional survival prediction methods that rely on expert interpretation and often suffer from loss of visual information. The model is pretrained on large-scale paired CT images and radiology reports to learn clinically relevant multimodal representations, then fine-tuned with instruction-based learning and integrated with a survival analysis head to enable end-to-end image understanding and natural language generation. The approach substantially outperforms existing baselines, with particularly notable gains under data-scarce clinical settings, and generates interpretable textual summaries that demonstrate prognostic value.
📝 Abstract
Accurate prognostication and risk estimation are essential for guiding clinical decision-making and optimizing patient management. While radiologist-assessed features from CT scans provide valuable indicators of disease severity and outcomes, interpreting such images requires expert knowledge, and translating rich visual information into textual summaries inevitably leads to information loss. In this work, we propose a vision-language framework for 3D CT image understanding that leverages large-scale open-sourced CT images paired with radiology reports through visual instruction tuning. This pre-training enables the model to learn clinically meaningful visual-textual representations, which can then be adapted to downstream survival prediction tasks. By incorporating a survival prediction head on top of the pre-trained model, our approach improves survival prediction from CT images and clinical data while generating clinically meaningful language responses to predefined questions. Experimental results demonstrate that our method outperforms baseline methods in survival prediction, particularly, when clinical data alone is less predictive. The code will be released upon acceptance.