RAU: Reference-based Anatomical Understanding with Vision Language Models

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical anatomical understanding in imaging is hindered by the scarcity of expert annotations, and existing vision-language models (VLMs) lack proficiency in cross-image referential localization and fine-grained segmentation. To address this, we propose Referential Anatomical Understanding (RAU), the first framework to leverage VLMs for reference-image-guided anatomical identification, relative spatial reasoning, and localization in medical images—coupled with SAM2 for pixel-level segmentation. RAU requires no annotations on target images; instead, it exploits the VLM to model anatomical spatial relationships between reference and target images, thereby guiding SAM2 for accurate segmentation. Evaluated on two in-distribution and two out-of-distribution datasets, RAU significantly outperforms fine-tuned SAM2 baselines under equivalent memory budgets, achieving superior segmentation accuracy and localization robustness. These results demonstrate RAU’s strong generalization capability and clinical scalability.

Technology Category

Application Category

📝 Abstract
Anatomical understanding through deep learning is critical for automatic report generation, intra-operative navigation, and organ localization in medical imaging; however, its progress is constrained by the scarcity of expert-labeled data. A promising remedy is to leverage an annotated reference image to guide the interpretation of an unlabeled target. Although recent vision-language models (VLMs) exhibit non-trivial visual reasoning, their reference-based understanding and fine-grained localization remain limited. We introduce RAU, a framework for reference-based anatomical understanding with VLMs. We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images, trained on a moderately sized dataset. We validate this capability through visual question answering (VQA) and bounding box prediction. Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2, enabling localization and pixel-level segmentation of small anatomical regions, such as vessel segments. Across two in-distribution and two out-of-distribution datasets, RAU consistently outperforms a SAM2 fine-tuning baseline using the same memory setup, yielding more accurate segmentations and more reliable localization. More importantly, its strong generalization ability makes it scalable to out-of-distribution datasets, a property crucial for medical image applications. To the best of our knowledge, RAU is the first to explore the capability of VLMs for reference-based identification, localization, and segmentation of anatomical structures in medical images. Its promising performance highlights the potential of VLM-driven approaches for anatomical understanding in automated clinical workflows.
Problem

Research questions and friction points this paper is trying to address.

Addresses anatomical understanding scarcity in medical imaging
Enhances reference-based identification using vision-language models
Enables fine-grained localization and segmentation of anatomical structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reference images for anatomical region identification
Integrates VLM spatial cues with SAM2 segmentation
Enables pixel-level segmentation of small anatomical structures
🔎 Similar Papers
No similar papers found.
Y
Yiwei Li
United Imaging Intelligence, Boston, MA
Yikang Liu
Yikang Liu
Shanghai Jiao Tong University
Computational Linguistics
J
Jiaqi Guo
United Imaging Intelligence, Boston, MA
L
Lin Zhao
United Imaging Intelligence, Boston, MA
Z
Zheyuan Zhang
United Imaging Intelligence, Boston, MA
X
Xiao Chen
United Imaging Intelligence, Boston, MA
B
Boris Mailhe
United Imaging Intelligence, Boston, MA
A
Ankush Mukherjee
United Imaging Intelligence, Boston, MA
Terrence Chen
Terrence Chen
UII America, Inc.
Medical ImagingImage-guided Interventions and SurgeryArtificial IntelligenceComputer Vision
Shanhui Sun
Shanhui Sun
UII America, Inc.
Machine LearningComputer VisionMedical imaging processingMedical Imaging and Virtual Reality