Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical vision-language models (MLLMs) exhibit weak anatomical reasoning and poor clinical answer consistency on surgical anatomy images—attributed to data complexity, annotation scarcity, and limitations in existing GRPO methods, including insufficient cross-knowledge sharing and over-reliance on single-step reasoning paths. To address these issues, we propose two innovations within the GRPO framework: (1) anatomy-aware curriculum learning, which dynamically adjusts question difficulty based on semantic similarity among answer options; and (2) group-diverse question answering augmentation, which enriches reasoning paths for challenging queries via multi-perspective rewriting and diversity-driven sampling. Our approach integrates semantic modeling with curriculum-based difficulty scheduling. Evaluated on SGG-VQA and OmniMedVQA benchmarks, it achieves significant performance gains, demonstrating improved anatomical reasoning fidelity and strong generalization across diverse medical multimodal reasoning tasks.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO's reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model's search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in https://github.com/tomato996/Anatomy-R1
Problem

Research questions and friction points this paper is trying to address.

Enhance anatomy reasoning in multimodal language models
Overcome uneven learning and limited strategy exploration
Improve performance on medical image question-answering benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive learning via anatomical similarity curriculum for incremental mastery.
Group diversity question augmentation to expand reasoning search space.
Enhances MLLM medical reasoning without extensive expert annotations.
🔎 Similar Papers
No similar papers found.
Z
Ziyang Song
Centre for Artificial Intelligence and Robotics (CAIR), Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences
Zelin Zang
Zelin Zang
Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences
Deep Learning
Z
Zuyao Chen
Centre for Artificial Intelligence and Robotics (CAIR), Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences
X
Xusheng Liang
Centre for Artificial Intelligence and Robotics (CAIR), Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences
Dong Yi
Dong Yi
Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences
Computer VisionPattern Recognition
Jinlin Wu
Jinlin Wu
Institute of Automation,Chinese Academy of Sciences
H
Hongbin Liu
Centre for Artificial Intelligence and Robotics (CAIR), Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences
J
Jiebo Luo
Centre for Artificial Intelligence and Robotics (CAIR), Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences