🤖 AI Summary
Existing LLM agent optimization methods focus solely on hyperparameter tuning while neglecting graph-structural deficiencies, leading to suboptimal agent design. Method: We propose the first framework for joint optimization of agent graph topology and node configurations. Our approach introduces a framework-agnostic global optimizer that synergistically combines reinforcement learning with gradient-informed heuristic search; it leverages reflective textual feedback from execution trajectories to guide iterative rollouts—enhancing sample efficiency and enabling precise localization of structural failures. Contributions/Results: (1) First end-to-end joint search over both structure and configuration; (2) Novel use of interpretable textual feedback as an optimization signal; (3) Achieves average improvements of 12%, 4.9%, and 4.86% over MIPROv2, GEPA, and GEPA+Merge on IFBench and HotpotQA, respectively, with fewer rollouts; additionally demonstrates strong generalization on interview and RAG agents.
📝 Abstract
Building reliable LLM agents requires decisions at two levels: the graph (which modules exist and how information flows) and the configuration of each node (models, prompts, tools, control knobs). Most existing optimizers tune configurations while holding the graph fixed, leaving structural failure modes unaddressed. We introduce Maestro, a framework-agnostic holistic optimizer for LLM agents that jointly searches over graphs and configurations to maximize agent quality, subject to explicit rollout/token budgets. Beyond numeric metrics, Maestro leverages reflective textual feedback from traces to prioritize edits, improving sample efficiency and targeting specific failure modes. On the IFBench and HotpotQA benchmarks, Maestro consistently surpasses leading prompt optimizers--MIPROv2, GEPA, and GEPA+Merge--by an average of 12%, 4.9%, and 4.86%, respectively; even when restricted to prompt-only optimization, it still leads by 9.65%, 2.37%, and 2.41%. Maestro achieves these results with far fewer rollouts than GEPA. We further show large gains on two applications (interviewer & RAG agents), highlighting that joint graph & configuration search addresses structural failure modes that prompt tuning alone cannot fix.