🤖 AI Summary
Existing tool-augmented large language models (LLMs) are constrained by a single-step, single-tool invocation paradigm, limiting their capability in multi-step, high-precision mathematical reasoning.
Method: We propose Multi-TAG—a zero-shot, inference-time framework that concurrently invokes multiple external tools at each reasoning step. It employs output verification, iterative reasoning correction, and integrated result aggregation to achieve robust decision-making—without any model fine-tuning and fully compatible with both open- and closed-source LLMs.
Contribution/Results: Multi-TAG fundamentally departs from conventional tool-selection mechanisms, instead emphasizing collaborative tool verification and fusion. Evaluated on four challenging benchmarks—MATH500, AIME, AMC, and OlympiadBench—it outperforms state-of-the-art methods by 6.0–7.5% absolute accuracy, significantly improving both solution accuracy and generalization across complex mathematical problems.
📝 Abstract
Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.