🤖 AI Summary
Medical vision-language models (MLLMs) exhibit weak anatomical reasoning and poor clinical answer consistency on surgical anatomy images—attributed to data complexity, annotation scarcity, and limitations in existing GRPO methods, including insufficient cross-knowledge sharing and over-reliance on single-step reasoning paths. To address these issues, we propose two innovations within the GRPO framework: (1) anatomy-aware curriculum learning, which dynamically adjusts question difficulty based on semantic similarity among answer options; and (2) group-diverse question answering augmentation, which enriches reasoning paths for challenging queries via multi-perspective rewriting and diversity-driven sampling. Our approach integrates semantic modeling with curriculum-based difficulty scheduling. Evaluated on SGG-VQA and OmniMedVQA benchmarks, it achieves significant performance gains, demonstrating improved anatomical reasoning fidelity and strong generalization across diverse medical multimodal reasoning tasks.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO's reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model's search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in https://github.com/tomato996/Anatomy-R1