Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current large language models (LLMs) lack systematic, transparent, and verifiable reasoning capabilities in medical applications, hindering clinical deployment. Method: We systematically review 60 key works from 2022–2025 and propose the first taxonomy of LLM-enhancement techniques for medical reasoning, categorizing methods into training-time (e.g., supervised fine-tuning, reinforcement learning) and test-time (e.g., prompt engineering, multi-agent collaboration) approaches, while supporting multimodal inputs—including text, medical imaging, and code. Contribution/Results: Our analysis identifies core challenges, notably the “fidelity–plausibility gap,” and advocates shifting evaluation paradigms from accuracy-centric metrics toward reasoning quality, process interpretability, and formal verifiability. The framework advances native multimodal medical reasoning modeling and provides a theoretical foundation and practical roadmap for developing clinically trustworthy AI systems.

Technology Category

Application Category

📝 Abstract

The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single-step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022-2025, we conclude by identifying critical challenges, including the faithfulness-plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs for systematic medical reasoning

Addressing gaps in clinical reasoning transparency

Improving multimodal medical AI applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-time strategies enhance medical reasoning

Test-time mechanisms improve reasoning transparency

Multimodal benchmarks evaluate reasoning quality

🔎 Similar Papers

LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction