🤖 AI Summary
This study addresses the lack of systematic understanding regarding the evolutionary dynamics of multi-agent AI systems in real-world development and maintenance. Conducting the first large-scale empirical analysis of eight prominent open-source multi-agent systems, the work examines 42,000 code commits and over 4,700 resolved issues through repository mining, commit categorization, issue tracking, and statistical modeling. It characterizes three distinct development profiles—sustained, stable, and bursty—and quantifies the distribution of maintenance activities while identifying three core issue categories: coordination, infrastructure, and defects. The findings reveal that 40.8% of commits correspond to feature enhancements, 22% of issues are defects, and 10% pertain to coordination challenges, with median resolution times ranging from under one day to two weeks, indicating an active yet fragile maintenance ecosystem.
📝 Abstract
The rapid emergence of multi-agent AI systems (MAS), including LangChain, CrewAI, and AutoGen, has shaped how large language model (LLM) applications are developed and orchestrated. However, little is known about how these systems evolve and are maintained in practice. This paper presents the first large-scale empirical study of open-source MAS, analyzing over 42K unique commits and over 4.7K resolved issues across eight leading systems. Our analysis identifies three distinct development profiles: sustained, steady, and burst-driven. These profiles reflect substantial variation in ecosystem maturity. Perfective commits constitute 40.8% of all changes, suggesting that feature enhancement is prioritized over corrective maintenance (27.4%) and adaptive updates (24.3%). Data about issues shows that the most frequent concerns involve bugs (22%), infrastructure (14%), and agent coordination challenges (10%). Issue reporting also increased sharply across all frameworks starting in 2023. Median resolution times range from under one day to about two weeks, with distributions skewed toward fast responses but a minority of issues requiring extended attention. These results highlight both the momentum and the fragility of the current ecosystem, emphasizing the need for improved testing infrastructure, documentation quality, and maintenance practices to ensure long-term reliability and sustainability.