Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates the interpretability and quality of reasoning traces in advanced large language models (e.g., o1/3, R1) on medical and mathematical tasks. Method: We propose a knowledge–reasoning disentanglement paradigm that decomposes chain-of-thought reasoning into fine-grained knowledge retrieval and reasoning operations, and introduce two novel quantitative metrics—Knowledge Correctness (KI) and Information Gain (InfoGain)—to assess them independently. Contribution/Results: Experiments reveal that supervised fine-tuning (SFT) improves final answer accuracy but degrades reasoning quality, reducing InfoGain by 38.9% on average; reinforcement learning (RL) enhances medical reasoning performance by pruning erroneous knowledge, simultaneously improving both KI and InfoGain; and R1-distilled models fail to generalize across domains. This study provides the first systematic analysis of the internal mechanisms governing LLM reasoning quality and establishes an interpretable, metric-driven evaluation framework for trustworthy AI.

Technology Category

Application Category

📝 Abstract

Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.

Problem

Research questions and friction points this paper is trying to address.

Evaluating knowledge correctness and reasoning quality in LLMs

Assessing reasoning transferability across medical and math domains

Analyzing SFT and RL impacts on reasoning accuracy and knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes reasoning into knowledge and reasoning parts

Introduces KI and InfoGain evaluation framework

Studies SFT and RL effects on reasoning quality

🔎 Similar Papers

No similar papers found.

Authors to Follow