RAU: Reference-based Anatomical Understanding with Vision Language Models

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Medical anatomical understanding in imaging is hindered by the scarcity of expert annotations, and existing vision-language models (VLMs) lack proficiency in cross-image referential localization and fine-grained segmentation. To address this, we propose Referential Anatomical Understanding (RAU), the first framework to leverage VLMs for reference-image-guided anatomical identification, relative spatial reasoning, and localization in medical images—coupled with SAM2 for pixel-level segmentation. RAU requires no annotations on target images; instead, it exploits the VLM to model anatomical spatial relationships between reference and target images, thereby guiding SAM2 for accurate segmentation. Evaluated on two in-distribution and two out-of-distribution datasets, RAU significantly outperforms fine-tuned SAM2 baselines under equivalent memory budgets, achieving superior segmentation accuracy and localization robustness. These results demonstrate RAU’s strong generalization capability and clinical scalability.

Technology Category

Application Category

📝 Abstract

Anatomical understanding through deep learning is critical for automatic report generation, intra-operative navigation, and organ localization in medical imaging; however, its progress is constrained by the scarcity of expert-labeled data. A promising remedy is to leverage an annotated reference image to guide the interpretation of an unlabeled target. Although recent vision-language models (VLMs) exhibit non-trivial visual reasoning, their reference-based understanding and fine-grained localization remain limited. We introduce RAU, a framework for reference-based anatomical understanding with VLMs. We first show that a VLM learns to identify anatomical regions through relative spatial reasoning between reference and target images, trained on a moderately sized dataset. We validate this capability through visual question answering (VQA) and bounding box prediction. Next, we demonstrate that the VLM-derived spatial cues can be seamlessly integrated with the fine-grained segmentation capability of SAM2, enabling localization and pixel-level segmentation of small anatomical regions, such as vessel segments. Across two in-distribution and two out-of-distribution datasets, RAU consistently outperforms a SAM2 fine-tuning baseline using the same memory setup, yielding more accurate segmentations and more reliable localization. More importantly, its strong generalization ability makes it scalable to out-of-distribution datasets, a property crucial for medical image applications. To the best of our knowledge, RAU is the first to explore the capability of VLMs for reference-based identification, localization, and segmentation of anatomical structures in medical images. Its promising performance highlights the potential of VLM-driven approaches for anatomical understanding in automated clinical workflows.

Problem

Research questions and friction points this paper is trying to address.

Addresses anatomical understanding scarcity in medical imaging

Enhances reference-based identification using vision-language models

Enables fine-grained localization and segmentation of anatomical structures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reference images for anatomical region identification

Integrates VLM spatial cues with SAM2 segmentation

Enables pixel-level segmentation of small anatomical structures

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis