🤖 AI Summary
This study addresses the significant challenges in diagnosing chronic wound infections from images, which arise from variations in etiology, anatomical location, and imaging conditions, compounded by the lack of interpretability in existing methods. The authors propose a novel medical vision–language modeling paradigm that leverages Qwen3-VL-4B-Thinking as a student model, incorporating chain-of-thought reasoning distilled from GPT-5.1 for the first time. By applying Group Relative Policy Optimization—a reinforcement learning strategy—on limited annotated data, the approach jointly optimizes classification performance and clinical reasoning without requiring expert-provided reasoning annotations. Evaluated on heterogeneous test sets, the method achieves 86.8% accuracy, 86.4% sensitivity, and 87.1% specificity. Multimodal assessments further reveal visual grounding consistency scores ranging from 0.722 to 0.903, with 94.2% of generated rationales rated as correct or partially correct by clinical experts, substantially outperforming strong baseline models.
📝 Abstract
Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.