From Failure to Mastery: Generating Hard Samples for Tool-use Agents

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing agent training data predominantly consist of simple, homogeneous trajectories that fail to capture complex implicit logical dependencies, thereby limiting model performance. This work proposes HardGen, an automated agent pipeline that leverages failure cases to generate high-quality synthetic data. HardGen is the first framework to transform failed trajectories into dynamic, structured API graphs, integrating modular high-level tools with conditional priors to produce verifiable, complex reasoning chains (Chain-of-Thought). A closed-loop evaluation feedback mechanism continuously refines sample quality. A 4B-parameter model trained on HardGen-synthesized data outperforms leading open- and closed-source large language models—including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5—across multiple benchmarks, demonstrating significant improvements in complex tool usage and reasoning capabilities.

Technology Category

Application Category

📝 Abstract

The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

tool-use agents

training data generation

complex reasoning

hard samples

logical dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hard Sample Generation

Tool-use Agents

API Graph