Testing and Enhancing Multi-Agent Systems for Robust Code Generation

πŸ“… 2025-10-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work systematically exposes critical robustness deficiencies in multi-agent code generation systems (MASs), primarily stemming from semantic inconsistency between planning and coding agentsβ€”leading to communication failure, with failure rates reaching 7.9%–83.3% under semantics-preserving perturbations. To address this, we propose the first fuzz-testing evaluation framework tailored for MAS-based code generation, featuring semantics-preserving mutation operators and a multi-stage information-flow analysis method. We further introduce a novel monitoring agent and a multi-prompt generation mechanism to dynamically calibrate cross-agent semantic consistency. Extensive experiments across multiple benchmarks and mainstream large language models demonstrate that our repair strategy recovers 40.0%–88.9% of previously failed cases, substantially enhancing system robustness. Our approach establishes a new paradigm for trustworthy multi-agent code generation.

Technology Category

Application Category

πŸ“ Abstract
Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation, demonstrating impressive performance on established benchmarks by decomposing complex coding tasks across specialized agents with different roles. Despite their prosperous development and adoption, their robustness remains pressingly under-explored, raising critical concerns for real-world deployment. This paper presents the first comprehensive study examining the robustness of MASs for code generation through a fuzzing-based testing approach. By designing a fuzzing pipeline incorporating semantic-preserving mutation operators and a novel fitness function, we assess mainstream MASs across multiple datasets and LLMs. Our findings reveal substantial robustness flaws of various popular MASs: they fail to solve 7.9%-83.3% of problems they initially resolved successfully after applying the semantic-preserving mutations. Through comprehensive failure analysis, we identify a common yet largely overlooked cause of the robustness issue: miscommunications between planning and coding agents, where plans lack sufficient detail and coding agents misinterpret intricate logic, aligning with the challenges inherent in a multi-stage information transformation process. Accordingly, we also propose a repairing method that encompasses multi-prompt generation and introduces a new monitor agent to address this issue. Evaluation shows that our repairing method effectively enhances the robustness of MASs by solving 40.0%-88.9% of identified failures. Our work uncovers critical robustness flaws in MASs and provides effective mitigation strategies, contributing essential insights for developing more reliable MASs for code generation.
Problem

Research questions and friction points this paper is trying to address.

Assessing robustness flaws in multi-agent systems for code generation
Identifying miscommunication between planning and coding agents as key issue
Proposing repair method to enhance system robustness through monitoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuzzing pipeline tests multi-agent code generation robustness
Monitor agent introduced to fix inter-agent communication failures
Multi-prompt generation method repairs semantic misinterpretation issues
πŸ”Ž Similar Papers
No similar papers found.
Z
Zongyi Lyu
The Hong Kong University of Science and Technology, China
S
Songqiang Chen
The Hong Kong University of Science and Technology, China
Zhenlan Ji
Zhenlan Ji
The Hong Kong University of Science and Technology
Software Engineering
L
Liwen Wang
The Hong Kong University of Science and Technology, China
S
Shuai Wang
The Hong Kong University of Science and Technology, China
Daoyuan Wu
Daoyuan Wu
Lingnan University, Hong Kong. Past Affiliation: HKUST; NTU; CUHK; SMU; PolyU
Large Language ModelAI SecurityBlockchain SecurityMobile SecuritySoftware Security
W
Wenxuan Wang
Renmin University of China, China
Shing-Chi Cheung
Shing-Chi Cheung
Chair Professor of Computer Science and Engineering, HKUST
Software EngineeringSoftware TestingProgram TestingProgram AnalysisAutomated Debugging