🤖 AI Summary
This work addresses the unclear contribution of intermediate outputs to final correctness in multi-agent code generation systems, which hinders effective optimization. To this end, we propose the Causal Analysis for Multi-agent code generation (CAM) framework, which introduces causal inference to systematically categorize intermediate features, simulate errors, quantify their causal effects on system correctness, and aggregate importance rankings. Our analysis reveals that context-dependent features and a hybrid backend architecture are particularly critical for performance gains. Experimental results demonstrate that CAM achieves a 73.3% fault-repair success rate and reduces intermediate token consumption by 66.8%. Furthermore, the hybrid backend architecture improves Pass@1 by 7.2%, offering a novel direction for the design and optimization of multi-agent code generation systems.
📝 Abstract
Despite the remarkable success that Multi-Agent Code Generation Systems (MACGS) have achieved, the inherent complexity of multi-agent architectures produces substantial volumes of intermediate outputs. To date, the individual importance of these intermediate outputs to the system correctness remains opaque, which impedes targeted optimization of MACGS designs. To address this challenge, we propose CAM, the first \textbf{C}ausality-based \textbf{A}nalysis framework for \textbf{M}ACGS that systematically quantifies the contribution of different intermediate features for system correctness. By comprehensively categorizing intermediate outputs and systematically simulating realistic errors on intermediate features, we identify the important features for system correctness and aggregate their importance rankings. We conduct extensive empirical analysis on the identified importance rankings. Our analysis reveals intriguing findings: first, we uncover context-dependent features\textemdash features whose importance emerges mainly through interactions with other features, revealing that quality assurance for MACGS should incorporate cross-feature consistency checks; second, we reveal that hybrid backend MACGS with different backend LLMs assigned according to their relative strength achieves up to 7.2\% Pass@1 improvement, underscoring hybrid architectures as a promising direction for future MACGS design. We further demonstrate CAM's practical utility through two applications: (1) failure repair which achieves a 73.3\% success rate by optimizing top-3 importance-ranked features and (2) feature pruning that reduces up to 66.8\% intermediate token consumption while maintaining generation performance. Our work provides actionable insights for MACGS design and deployment, establishing causality analysis as a powerful approach for understanding and improving MACGS.