🤖 AI Summary
This work addresses the frequent failures of large language models in multi-step tool orchestration tasks, often caused by incorrect API invocation sequences or improper parameter passing. To tackle this, the authors construct a reinforcement learning environment grounded in cached real-world API responses and propose a constrained data synthesis approach to generate multi-step execution trajectories with controllable complexity. They further design a hierarchical reward mechanism that separately evaluates the atomic validity of individual API calls and the logical correctness of the overall orchestration. This framework overcomes the limitations of traditional binary rewards and purely simulated data, achieving significant improvements in episode-level accuracy on the ComplexFuncBench benchmark. Ablation studies confirm the necessity of both reward signals for effective learning.
📝 Abstract
Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness.
We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.