When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

๐Ÿ“… 2026-04-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the โ€œtool-ignoredโ€ problem, wherein existing tool-integrated reasoning models often erroneously disregard correct tool outputs when they conflict with the modelโ€™s internal reasoning. To mitigate this issue, the paper proposes an Adaptive Tool Trust Calibration (ATTC) framework that introduces, for the first time, a dynamic trust mechanism based on code generation confidence. ATTC jointly optimizes confidence estimation, code generation, and tool invocation to determine whether to accept tool feedback. Experimental results across multiple open-source language models and mathematical reasoning benchmarks demonstrate that ATTC effectively alleviates the tool-ignored problem, yielding consistent performance gains of 4.1%โ€“7.5% and significantly improving overall reasoning accuracy.
๐Ÿ“ Abstract
Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored''. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, resulting in a performance increase of 4.1% to 7.5%.
Problem

Research questions and friction points this paper is trying to address.

Tool-Integrated Reasoning
Trust Calibration
Math Reasoning
Large Reasoning Models
Tool Ignored
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Tool Trust Calibration
Tool-Integrated Reasoning
Tool Ignored
Confidence-based Calibration
Mathematical Reasoning
๐Ÿ”Ž Similar Papers
No similar papers found.