Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing tool-integrated reasoning methods, which suffer from insufficient and biased supervision due to the scarcity of high-quality synthetic trajectories and sparse reward signals. To overcome these challenges, the authors propose AutoTraj, a two-stage framework that first automatically constructs high-quality tool-use trajectories through a generate-evaluate-repair mechanism, and then optimizes reasoning paths by integrating multidimensional trajectory preference modeling with reinforcement learning. The approach innovatively introduces a trajectory repair module—leveraging a large language model as a repairer—and combines supervised fine-tuning with trajectory-level reward learning. Evaluated on real-world benchmarks, AutoTraj significantly enhances both the reliability and performance of tool-augmented reasoning systems.

Technology Category

Application Category

📝 Abstract
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to solve complex tasks by interacting with external tools, yet existing approaches depend on high-quality synthesized trajectories selected by scoring functions and sparse outcome-based rewards, providing limited and biased supervision for learning TIR. To address these challenges, in this paper, we propose AutoTraj, a two-stage framework that automatically learns TIR by repairing and rewarding tool-use trajectories. Specifically, in the supervised fine-tuning (SFT) stage, AutoTraj generates multiple candidate tool-use trajectories for each query and evaluates them along multiple dimensions. High-quality trajectories are directly retained, while low-quality ones are repaired using a LLM (i.e., LLM-as-Repairer). The resulting repaired and high-quality trajectories form a synthetic SFT dataset, while each repaired trajectory paired with its original low-quality counterpart constitutes a dataset for trajectory preference modeling. In the reinforcement learning (RL) stage, based on the preference dataset, we train a trajectory-level reward model to assess the quality of reasoning paths and combine it with outcome and format rewards, thereby explicitly guiding the optimization toward reliable TIR behaviors. Experiments on real-world benchmarks demonstrate the effectiveness of AutoTraj in TIR.
Problem

Research questions and friction points this paper is trying to address.

Tool-Integrated Reasoning
trajectory repair
reward modeling
supervision bias
tool-use trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-Integrated Reasoning
Trajectory Repair
Preference Modeling
Reinforcement Learning
LLM-as-Repairer
🔎 Similar Papers
No similar papers found.
S
Siyu Gong
School of Computer Science and Engineering, Southeast University, Nanjing, China; Key Lab. of Computer Network and Information Integration (Southeast University), MOE, China
Linan Yue
Linan Yue
Southeast University
Trustworthy AINatural Language Processing
Weibo Gao
Weibo Gao
University of Science and Technology of China
Data MiningEducational Big DataIntelligent EducationDomain AdaptionLLM Agent
Fangzhou Yao
Fangzhou Yao
University of Illinois at Urbana-Champaign
Cloud ComputingDistributed Systems
S
Shimin Di
School of Computer Science and Engineering, Southeast University, Nanjing, China; Key Lab. of Computer Network and Information Integration (Southeast University), MOE, China
Lei Feng
Lei Feng
Professor, Southeast University
Machine LearningData ScienceStatistics
Min-Ling Zhang
Min-Ling Zhang
Professor, School of Computer Science and Engineering, Southeast University, China
Artificial IntelligenceMachine LearningData Mining