🤖 AI Summary
Current 3D vision-language models lack anatomically grounded, clinically aligned stepwise reasoning capabilities, hindering their trustworthy collaboration in real-world diagnostic settings. To address this, we introduce 3DReasonKnee, the first dataset enabling grounded reasoning over 3D medical images—comprising 7,970 knee MRI scans annotated by clinical experts with 3D bounding boxes, diagnostic questions, multi-step clinical reasoning chains, and structured severity assessments, yielding 494k high-quality five-tuples. Based on this, we establish ReasonKnee-Bench, the first clinically aligned 3D medical vision-language modeling benchmark. This work pioneers the integration of physician-guided 3D spatial reasoning and structured diagnostic evaluation into multimodal medical AI. Empirical results demonstrate significant improvements in anatomical localization accuracy, causal reasoning fidelity, and clinical decision consistency. Our contribution provides both foundational data and a rigorous evaluation framework for interpretable, trustworthy orthopedic AI diagnosis.
📝 Abstract
Current Vision-Language Models (VLMs) struggle to ground anatomical regions in 3D medical images and reason about them in a step-by-step manner, a key requirement of real-world diagnostic assessment. This ability is essential for aligning model outputs with the diagnostic workflows clinicians use in practice, enabling trustworthy clinician-AI collaboration. Existing 3D datasets provide localization labels, but none support this "grounded reasoning" ability. To address this gap, we introduce 3DReasonKnee, the first 3D grounded reasoning dataset for medical images, which provides 494k high-quality quintuples derived from 7,970 3D knee MRI volumes. Each quintuple includes: (1) the 3D MRI volume, (2) a diagnostic question targeting a specific anatomical region (3) a 3D bounding box localizing the relevant anatomical structures, (4) clinician-generated diagnostic reasoning steps that explicitly detail the 3D reasoning process, and (5) structured severity assessments for the relevant anatomical region. The creation and validation of 3DReasonKnee, involving over 450 hours of expert clinician time for manually segmenting MRIs and generating reasoning chains, ensures its superior quality and clinical relevance. We establish ReasonKnee-Bench to evaluate localization and diagnostic accuracy, providing insight into VLM ability to perform grounding and severity assessment across anatomical regions and diagnostic inquiries. We benchmark five state-of-the-art VLMs, providing baseline performance for ReasonKnee-Bench. By providing this unique resource of expert-annotated 3D reasoning pathways, 3DReasonKnee serves as a repository of orthopedic surgeons' diagnostic expertise and offers a vital testbed for advancing multimodal medical AI systems towards 3D, clinically aligned, localized decision-making capabilities. The dataset can be found in: https://huggingface.co/datasets/rajpurkarlab/3DReasonKnee