🤖 AI Summary
Medical anatomical understanding in imaging is hindered by the scarcity of expert annotations, and existing vision-language models (VLMs) lack proficiency in cross-image referential localization and fine-grained segmentation. To address this, we propose Referential Anatomical Understanding (RAU), the first framework to leverage VLMs for reference-image-guided anatomical identification, relative spatial reasoning, and localization in medical images—coupled with SAM2 for pixel-level segmentation. RAU requires no annotations on target images; instead, it exploits the VLM to model anatomical spatial relationships between reference and target images, thereby guiding SAM2 for accurate segmentation. Evaluated on two in-distribution and two out-of-distribution datasets, RAU significantly outperforms fine-tuned SAM2 baselines under equivalent memory budgets, achieving superior segmentation accuracy and localization robustness. These results demonstrate RAU’s strong generalization capability and clinical scalability.
📝 Abstract
Anatomical understanding through deep learning is critical for automatic report generation, intra-operative navigation, and organ localization in medical imaging; however, its progress is constrained by the scarcity of expert-labeled data. A promising remedy is to leverage an annotated reference image to guide the interpretation of an unlabeled target. Although recent vision-language models (VLMs) exhibit non-trivial visual reasoning, their reference-based understanding and fine-grained localization remain limited. We introduce RAU, a framework for reference-based anatomical understanding with VLMs. We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images, trained on a moderately sized dataset. We validate this capability through visual question answering (VQA) and bounding box prediction. Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2, enabling localization and pixel-level segmentation of small anatomical regions, such as vessel segments. Across two in-distribution and two out-of-distribution datasets, RAU consistently outperforms a SAM2 fine-tuning baseline using the same memory setup, yielding more accurate segmentations and more reliable localization. More importantly, its strong generalization ability makes it scalable to out-of-distribution datasets, a property crucial for medical image applications. To the best of our knowledge, RAU is the first to explore the capability of VLMs for reference-based identification, localization, and segmentation of anatomical structures in medical images. Its promising performance highlights the potential of VLM-driven approaches for anatomical understanding in automated clinical workflows.