🤖 AI Summary
Existing AI assistants exhibit limited robustness and flexibility in complex, multi-step tasks and lack fine-grained benchmarks to systematically evaluate memory retention, planning capability, and tool utilization.
Method: We propose TDAG, a dynamic task-decomposition multi-agent framework that enhances adaptability and robustness via on-demand generation of specialized sub-agents, context-aware instantiation, and structured tool invocation with explicit memory management. Concurrently, we introduce ItineraryBench—the first progressive, hierarchical benchmark for multi-step dependent travel planning—enabling granular, stage-wise evaluation.
Contribution/Results: TDAG pioneers the synergistic integration of dynamic task decomposition and just-in-time agent instantiation; ItineraryBench is the first fine-grained benchmark explicitly targeting multi-step planning proficiency. Experiments show TDAG achieves a 37.2% improvement in task completion rate and reduces error propagation by 58.6% over baselines including ChatDev and MetaGPT.
📝 Abstract
The emergence of Large Language Models (LLMs) like ChatGPT has inspired the development of LLM-based agents capable of addressing complex, real-world tasks. However, these agents often struggle during task execution due to methodological constraints, such as error propagation and limited adaptability. To address this issue, we propose a multi-agent framework based on dynamic Task Decomposition and Agent Generation (TDAG). This framework dynamically decomposes complex tasks into smaller subtasks and assigns each to a specifically generated subagent, thereby enhancing adaptability in diverse and unpredictable real-world tasks. Simultaneously, existing benchmarks often lack the granularity needed to evaluate incremental progress in complex, multi-step tasks. In response, we introduce ItineraryBench in the context of travel planning, featuring interconnected, progressively complex tasks with a fine-grained evaluation system. ItineraryBench is designed to assess agents' abilities in memory, planning, and tool usage across tasks of varying complexity. Our experimental results reveal that TDAG significantly outperforms established baselines, showcasing its superior adaptability and context awareness in complex task scenarios.