Boosting Tool Use of Large Language Models via Iterative Reinforced Fine-Tuning

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit limited performance on complex, multi-step tool-use tasks, and scaling synthetic data often degrades supervised fine-tuning (SFT) performance due to deteriorating data quality. Method: We propose an iterative reinforcement fine-tuning framework that integrates preference optimization (e.g., DPO/RLHF) with defect-driven data curation. It comprises three core components: (1) automatic identification of defective samples via policy-model feedback; (2) construction of fine-grained, high signal-to-noise preference pairs using Monte Carlo Tree Search (MCTS); and (3) an easy-to-hard progressive SFT warm-up strategy. Contribution/Results: The framework effectively mitigates data-quality degradation. Experiments demonstrate that, at comparable parameter counts, our method consistently outperforms leading open- and closed-source LLMs on multi-step tool-calling benchmarks, achieving substantial training gains alongside strong generalization and robustness.

Technology Category

Application Category

📝 Abstract
Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities. Effectively leveraging this potential for complex tasks hinges crucially on improving their ability to use tools. Synthesizing tool use data by simulating the real world is an effective approach. Nevertheless, our investigation reveals that training gains significantly decay as the scale of these data increases. The primary factor is the model's poor performance (a.k.a deficiency) in complex scenarios, which hinders learning from data using SFT. Driven by this objective, we propose an iterative reinforced fine-tuning strategy to continually guide the model to alleviate it. Specifically, we first identify deficiency-related data based on feedback from the policy model, then perform a Monte Carlo Tree Search to collect fine-grained preference pairs to pinpoint deficiencies. Subsequently, we update the policy model using preference optimization to align with ground truth and misalign with deficiencies. This process can be iterated. Moreover, before the iteration, we propose an easy-to-hard warm-up SFT strategy to facilitate learning from challenging data. The experiments demonstrate our models go beyond the same parametric models, outperforming many larger open-source and closed-source models. Additionally, it has achieved notable training gains in complex tool use scenarios.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Tool Usage Efficiency
Performance Degradation with Large Datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative Reinforcement Fine-tuning
Problem-specific Model Improvement
Gradual Learning Process
🔎 Similar Papers
No similar papers found.
Y
Yirong Zeng
Harbin Institute of Technology SCIR Lab
X
Xiao Ding
Harbin Institute of Technology SCIR Lab
Y
Yuxian Wang
Huawei Technologies Co., Ltd
Weiwen Liu
Weiwen Liu
Associate Professor, Shanghai Jiao Tong University
large language modelsAI agentsrecommender systems
W
Wu Ning
Huawei Technologies Co., Ltd
Yutai Hou
Yutai Hou
Huawei
LLMNLPDialogueAlignmentMeta Learning
X
Xu Huang
Huawei Noah’s Ark Lab
Bing Qin
Bing Qin
Professor in Harbin Institute of Technology
Natural Language ProcessingInformation ExtractionSentiment Analysis
T
Ting Liu
Harbin Institute of Technology SCIR Lab