Disentangling Reasoning and Knowledge in Medical Large Language Models

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing medical LLM evaluation benchmarks (e.g., MedQA-USMLE) conflate reasoning capability with factual memorization, hindering accurate assessment of clinical diagnostic reasoning. Method: This work systematically disentangles reasoning and knowledge components in medical question answering for the first time, constructing the first dual-track benchmark subset—classified via PubMedBERT (81% accuracy)—to isolate reasoning-demanding items (only 32.8% require complex reasoning). We propose an adversarial misdirection detection and backtracking reasoning training paradigm, integrating supervised fine-tuning and reinforcement learning (RL) on clinical case–enhanced data to train BioMed-R1. Contribution/Results: BioMed-R1 achieves state-of-the-art reasoning performance among models of comparable scale; its m1 variant exhibits a 13.4-point gap between knowledge and reasoning scores, highlighting reasoning-specific gains. RL training significantly improves reasoning accuracy, and applying this paradigm to general-purpose LLMs markedly enhances their reasoning robustness.

Technology Category

Application Category

📝 Abstract

Medical reasoning in large language models (LLMs) aims to emulate clinicians' diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, m1 scores 60.5 on knowledge but only 47.1 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.

Problem

Research questions and friction points this paper is trying to address.

Separating reasoning and knowledge in medical LLM benchmarks

Evaluating performance gaps in biomedical vs general models

Improving robustness in adversarial medical reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Separates reasoning and knowledge using PubMedBERT classifier

Trains BioMed-R1 with fine-tuning and reinforcement learning

Incorporates clinical case reports and adversarial scenarios

🔎 Similar Papers

Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval