THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) still suffer from insufficient numerical computation and symbolic manipulation accuracy in mathematical reasoning. To address this, we propose a tool-augmented fine-grained reasoning framework. First, we introduce TIRGen, a method for automatically generating high-quality tool-integrated reasoning data. Second, we design a multi-agent Actor-Critic architecture that jointly optimizes trajectory-level problem solving and step-level code generation via hierarchical reinforcement learning. Third, we incorporate a dynamic self-correction mechanism grounded in tool execution feedback, enabling real-time refinement of reasoning traces. Our approach achieves state-of-the-art performance on multiple mathematical and code-generation benchmarks among models of comparable scale, yielding substantial accuracy improvements. Moreover, it demonstrates strong generalization across diverse LLM architectures. This work establishes a scalable, optimization-friendly paradigm for tool-augmented reasoning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both trajectory-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.
Problem

Research questions and friction points this paper is trying to address.

Addresses tool-integrated reasoning data construction challenges
Performs fine-grained hierarchical optimization via RL
Enhances inference with self-correction using tool feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent actor-critic pipeline for tool-integrated dataset generation
Hierarchical RL optimizing trajectory and step-level code generation
Self-correction mechanism using tool feedback during inference
🔎 Similar Papers
No similar papers found.
Qikai Chang
Qikai Chang
University of Science and Technology of China
OCRLLM
Z
Zhenrong Zhang
iFLYTEK Research
P
Pengfei Hu
University of Science and Technology of China
Jiefeng Ma
Jiefeng Ma
USTC
NLP、Language Modelling、Document Intelligence
Y
Yicheng Pan
University of Science and Technology of China
J
Jianshu Zhang
iFLYTEK Research
J
Jun Du
University of Science and Technology of China
Q
Quan Liu
iFLYTEK Research
J
Jianqing Gao
iFLYTEK Research