🤖 AI Summary
This work addresses the limited generalization of current large language models in diverse planning tasks, primarily caused by the scarcity of high-quality interactive data and gradient conflicts during multi-task training. To overcome these challenges, we propose MagicAgent, a foundational planning model that introduces a lightweight and scalable synthetic trajectory generation framework. This framework integrates hierarchical task decomposition, tool augmentation, and multi-constraint scheduling to produce synthetic data spanning a broad spectrum of planning scenarios. A two-stage training paradigm—supervised fine-tuning followed by multi-objective reinforcement learning—effectively mitigates inter-task interference and substantially enhances cross-task generalization. Experimental results demonstrate that MagicAgent-32B and MagicAgent-30B-A3B significantly outperform existing open- and closed-source models on benchmarks such as Worfbench and NaturalPlan, achieving a peak accuracy of 86.9%.
📝 Abstract
The evolution of Large Language Models (LLMs) from passive text processors to autonomous agents has established planning as a core component of modern intelligence. However, achieving generalized planning remains elusive, not only by the scarcity of high-quality interaction data but also by inherent conflicts across heterogeneous planning tasks. These challenges result in models that excel at isolated tasks yet struggle to generalize, while existing multi-task training attempts suffer from gradient interference. In this paper, we present \textbf{MagicAgent}, a series of foundation models specifically designed for generalized agent planning. We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks, including hierarchical task decomposition, tool-augmented planning, multi-constraint scheduling, procedural logic orchestration, and long-horizon tool execution. To mitigate training conflicts, we propose a two-stage training paradigm comprising supervised fine-tuning followed by multi-objective reinforcement learning over both static datasets and dynamic environments. Empirical results demonstrate that MagicAgent-32B and MagicAgent-30B-A3B deliver superior performance, achieving accuracies of $75.1\%$ on Worfbench, $55.9\%$ on NaturalPlan, $57.5\%$ on $τ^2$-Bench, $86.9\%$ on BFCL-v3, and $81.2\%$ on ACEBench, as well as strong results on our in-house MagicEval benchmarks. These results substantially outperform existing sub-100B models and even surpass leading closed-source models.