Self-Training Large Language Models for Tool-Use Without Demonstrations

📅 2025-02-09

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models (LLMs) suffer from factual hallucinations and arithmetic errors, particularly in knowledge-intensive and mathematical reasoning tasks. Existing tool-augmentation approaches rely heavily on manually curated, high-quality demonstration examples—limiting generalizability and scalability. This paper introduces the first end-to-end, demonstration-free framework for autonomous tool-use learning: it leverages zero-shot prompting to elicit high-fidelity tool-calling trajectories (e.g., calculator invocations, retrieval API calls) from LLMs, then iteratively refines the model via self-training, supervised fine-tuning (SFT), and direct preference optimization (DPO), using synthetically constructed tool-augmented data derived from TriviaQA and GSM8K. Our method achieves a +3.7% improvement on long-tail knowledge QA (PopQA), validating effective tool integration without human demonstrations, while also exposing cross-task generalization bottlenecks. The core contribution is the first fully automated, LLM-driven pipeline for synthesizing and end-to-end learning of tool-use trajectories.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) remain prone to factual inaccuracies and computational errors, including hallucinations and mistakes in mathematical reasoning. Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations. In this paper, we investigate whether LLMs can learn to use tools without demonstrations. First, we analyse zero-shot prompting strategies to guide LLMs in tool utilisation. Second, we propose a self-training method to synthesise tool-use traces using the LLM itself. We compare supervised fine-tuning and preference fine-tuning techniques for fine-tuning the model on datasets constructed using existing Question Answering (QA) datasets, i.e., TriviaQA and GSM8K. Experiments show that tool-use enhances performance on a long-tail knowledge task: 3.7% on PopQA, which is used solely for evaluation, but leads to mixed results on other datasets, i.e., TriviaQA, GSM8K, and NQ-Open. Our findings highlight the potential and challenges of integrating external tools into LLMs without demonstrations.

Problem

Research questions and friction points this paper is trying to address.

Self-training LLMs for tool-use

Zero-shot prompting strategies

Enhancing performance without demonstrations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-training for tool-use

Zero-shot prompting strategies

Preference fine-tuning techniques

🔎 Similar Papers

Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?