🤖 AI Summary
This work addresses the redundancy and high computational overhead in existing query-level workflow generation methods for multi-agent systems, where the trade-off between task-level and query-level approaches remains unclear. To this end, the authors propose SCALE, a framework that leverages a self-predictive optimizer and few-shot calibration to efficiently generate general-purpose workflows at the task level, thereby avoiding per-query workflow generation and full execution for validation. SCALE introduces a novel, low-cost task-level evaluation mechanism based on self-evolution and generative reward modeling, revealing that a small set of task-level workflows can effectively cover the majority of query scenarios. Experiments demonstrate that SCALE achieves comparable performance—averaging only a 0.61% drop across multiple datasets—while reducing token consumption by up to 83%, substantially improving efficiency and scalability.
📝 Abstract
Multi-Agent Systems (MAS) built on large language models typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but their relative costs and benefits remain unclear. After rethinking and empirical analyses, we show that query-level workflow generation is not always necessary, since a small set of top-K best task-level workflows together already covers equivalent or even more queries. We further find that exhaustive execution-based task-level evaluation is both extremely token-costly and frequently unreliable. Inspired by the idea of self-evolution and generative reward modeling, we propose a low-cost task-level generation framework \textbf{SCALE}, which means \underline{\textbf{S}}elf prediction of the optimizer with few shot \underline{\textbf{CAL}}ibration for \underline{\textbf{E}}valuation instead of full validation execution. Extensive experiments demonstrate that \textbf{SCALE} maintains competitive performance, with an average degradation of just 0.61\% compared to existing approach across multiple datasets, while cutting overall token usage by up to 83\%.