THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Large language models (LLMs) still suffer from insufficient numerical computation and symbolic manipulation accuracy in mathematical reasoning. To address this, we propose a tool-augmented fine-grained reasoning framework. First, we introduce TIRGen, a method for automatically generating high-quality tool-integrated reasoning data. Second, we design a multi-agent Actor-Critic architecture that jointly optimizes trajectory-level problem solving and step-level code generation via hierarchical reinforcement learning. Third, we incorporate a dynamic self-correction mechanism grounded in tool execution feedback, enabling real-time refinement of reasoning traces. Our approach achieves state-of-the-art performance on multiple mathematical and code-generation benchmarks among models of comparable scale, yielding substantial accuracy improvements. Moreover, it demonstrates strong generalization across diverse LLM architectures. This work establishes a scalable, optimization-friendly paradigm for tool-augmented reasoning.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both trajectory-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.

Problem

Research questions and friction points this paper is trying to address.

Addresses tool-integrated reasoning data construction challenges

Performs fine-grained hierarchical optimization via RL

Enhances inference with self-correction using tool feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent actor-critic pipeline for tool-integrated dataset generation

Hierarchical RL optimizing trajectory and step-level code generation

Self-correction mechanism using tool feedback during inference

🔎 Similar Papers

No similar papers found.