🤖 AI Summary
Multi-agent systems face performance bottlenecks in complex tasks due to coarse-grained evaluation and poor reusability of learned experience.
Method: We propose an evaluation-driven, reusable experience accumulation framework featuring a novel 360° fine-grained multi-dimensional evaluation mechanism, a hierarchical multi-agent architecture, a dual-level experience pool (with structured storage and semantic retrieval), and an LLM-powered self-reflection and experience distillation module—enabling a closed “evaluation–feedback–evolution” loop.
Contribution/Results: This work pioneers the integration of an organized evaluation paradigm into multi-LLM collaborative systems, significantly enhancing long-term team performance and cross-task generalization. Extensive experiments across multiple complex task benchmarks demonstrate consistent and substantial improvements over state-of-the-art baselines, validating both the effectiveness and scalability of evaluation-driven experience accumulation.
📝 Abstract
Large language model agents have demonstrated remarkable advancements across various complex tasks. Recent works focus on optimizing the agent team or employing self-reflection to iteratively solve complex tasks. Since these agents are all based on the same LLM, only conducting self-evaluation or removing underperforming agents does not substantively enhance the capability of the agents. We argue that a comprehensive evaluation and accumulating experience from evaluation feedback is an effective approach to improving system performance. In this paper, we propose Reusable Experience Accumulation with 360$^circ$ Assessment (360$^circ$REA), a hierarchical multi-agent framework inspired by corporate organizational practices. The framework employs a novel 360$^circ$ performance assessment method for multi-perspective performance evaluation with fine-grained assessment. To enhance the capability of agents in addressing complex tasks, we introduce dual-level experience pool for agents to accumulate experience through fine-grained assessment. Extensive experiments on complex task datasets demonstrate the effectiveness of 360$^circ$REA.