Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Current vision-language models (VLMs) treat medical images as holistic inputs, neglecting the fine-grained anatomical and pathological details essential for clinical diagnosis, thereby struggling with interpretation challenges arising from imaging heterogeneity. To address this, we propose an anatomy-aware fine-grained modeling paradigm that explicitly establishes correspondences between local image regions and clinical semantics. Our approach integrates an anatomy localization encoder, a multi-scale feature alignment mechanism, and a structured medical knowledge enhancement module—enabling zero-shot anatomical-level reasoning. Extensive experiments demonstrate that our method significantly outperforms baselines on both in-distribution and out-of-distribution medical imaging benchmarks across disease classification and radiological sign localization tasks, exhibiting strong generalization. Furthermore, transfer to downstream segmentation tasks validates the model’s capacity to encode and transfer anatomy-pathology–aware representations.

Technology Category

Application Category

📝 Abstract

Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM's encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.

Problem

Research questions and friction points this paper is trying to address.

Addresses imaging heterogeneity in radiology for accurate disease interpretation

Integrates fine-grained anatomical details with clinical knowledge for diagnosis

Aligns multi-scale medical information to generate clinically-interpretable predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Localizes anatomical features in medical images

Enriches regions with structured clinical knowledge

Aligns multi-scale information for disease prediction

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis