ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited conceptual understanding and incur prohibitively high inference costs on challenging agentic benchmarks such as Humanity’s Last Exam (HLE). Method: This paper proposes a lightweight, model-driven intelligent tool orchestration framework grounded in multi-objective reinforcement learning, jointly optimizing for answer correctness, inference efficiency, and user preference. It employs an 8B-parameter model trained with result-aware, efficiency-aware, and preference-aware objectives, enabling strong generalization to unseen tools. Contribution/Results: The framework achieves 37.1% accuracy on HLE—surpassing GPT-5—and attains state-of-the-art performance on tau2-Bench and FRAMES. It reduces inference cost to 30% of comparable methods while improving throughput by 2.5×, demonstrating significant gains in both efficacy and efficiency.

Technology Category

Application Category

📝 Abstract

Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

Problem

Research questions and friction points this paper is trying to address.

Efficiently solving complex tasks using small orchestrators and tools

Training orchestrators with reinforcement learning for user-aligned tool use

Achieving higher accuracy at lower cost than previous tool-use agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Small orchestrators manage models and tools efficiently

Reinforcement learning with multi-aware rewards trains orchestrators

Lightweight orchestration model composes diverse tools effectively

🔎 Similar Papers

ToolGen: Unified Tool Retrieval and Calling via Generation