Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing agent frameworks struggle to effectively integrate multiple expert models and diverse skills due to their reliance on a single language model and fixed skill invocation, limiting performance on complex multimodal tasks. This work proposes a hierarchical model-skill library architecture that formulates task execution as a sequential decision-making process. A lightweight policy network dynamically orchestrates frozen expert models and a two-tier skill library through reinforcement learning, enabling efficient collaboration without step-level supervision. The approach introduces a novel outcome-based reinforcement learning coordination mechanism that generalizes to unseen model and skill combinations without fine-tuning. Evaluated across ten benchmarks, the method achieves an average accuracy of 70.1%, outperforming GPT-5 and Gemini-2.5-Pro; even when incorporating previously unseen experts, it maintains a leading 59.5% accuracy over closed-source models while offering low latency and high efficiency.

📝 Abstract

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.

Problem

Research questions and friction points this paper is trying to address.

large language models

modular skills

model-skill orchestration

multimodal tasks

complementary strengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Model-Skill Orchestration

Hierarchical Ensembles