🤖 AI Summary
Current large language models (LLMs) lack systematic, transparent, and verifiable reasoning capabilities in medical applications, hindering clinical deployment.
Method: We systematically review 60 key works from 2022–2025 and propose the first taxonomy of LLM-enhancement techniques for medical reasoning, categorizing methods into training-time (e.g., supervised fine-tuning, reinforcement learning) and test-time (e.g., prompt engineering, multi-agent collaboration) approaches, while supporting multimodal inputs—including text, medical imaging, and code.
Contribution/Results: Our analysis identifies core challenges, notably the “fidelity–plausibility gap,” and advocates shifting evaluation paradigms from accuracy-centric metrics toward reasoning quality, process interpretability, and formal verifiability. The framework advances native multimodal medical reasoning modeling and provides a theoretical foundation and practical roadmap for developing clinically trustworthy AI systems.
📝 Abstract
The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single-step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022-2025, we conclude by identifying critical challenges, including the faithfulness-plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.