From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This paper identifies “Tool-Induced Myopia” (TIM) in tool-augmented large language models—particularly those equipped with code interpreters—where models correctly invoke tools and produce accurate final answers but substitute external tool outputs for internal logical reasoning, thereby fragmenting the reasoning chain and degrading interpretability. Method: We formally define TIM, introduce the PYMATH mathematical benchmark and a multidimensional evaluation framework, and empirically establish a positive correlation between tool invocation frequency and reasoning degradation. To mitigate TIM, we propose a preference-optimization-based training framework that preserves answer accuracy while enhancing reasoning coherence. Contribution/Results: Experiments demonstrate that our method improves final-answer accuracy by 19.3 percentage points on competitive mathematical reasoning tasks and reduces high-risk TIM instances from 55% to a controllable level—achieving, for the first time, simultaneous gains in both answer correctness and reasoning depth.

Technology Category

Application Category

📝 Abstract

Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.

Problem

Research questions and friction points this paper is trying to address.

Tool-augmented LLMs treat tool outputs as reasoning substitutes

Tool use causes reasoning degradation despite accuracy gains

Tool-Induced Myopia leads to unjustified but seemingly correct solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed PYMATH benchmark with 1697 competition math problems

Created multi-dimensional evaluation suite for reasoning degradation

Proposed preference-optimization framework to improve tool usage

🔎 Similar Papers

No similar papers found.