A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing tool-augmented large language models (LLMs) are constrained by a single-step, single-tool invocation paradigm, limiting their capability in multi-step, high-precision mathematical reasoning. Method: We propose Multi-TAG—a zero-shot, inference-time framework that concurrently invokes multiple external tools at each reasoning step. It employs output verification, iterative reasoning correction, and integrated result aggregation to achieve robust decision-making—without any model fine-tuning and fully compatible with both open- and closed-source LLMs. Contribution/Results: Multi-TAG fundamentally departs from conventional tool-selection mechanisms, instead emphasizing collaborative tool verification and fusion. Evaluated on four challenging benchmarks—MATH500, AIME, AMC, and OlympiadBench—it outperforms state-of-the-art methods by 6.0–7.5% absolute accuracy, significantly improving both solution accuracy and generalization across complex mathematical problems.

Technology Category

Application Category

📝 Abstract

Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.

Problem

Research questions and friction points this paper is trying to address.

Enhancing math reasoning by using multiple tools simultaneously

Improving accuracy in complex multi-step math problems

Enabling tool aggregation without finetuning for diverse LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-tool aggregation for math reasoning

Concurrent tool invocation per reasoning step

Finetuning-free framework for any LLM backbone

🔎 Similar Papers

No similar papers found.

Authors to Follow