🤖 AI Summary
Although multi-agent systems (MAS) and large language models are increasingly prevalent throughout the software development life cycle (SDLC), their impact on the fairness of developer tools remains underexplored. This study presents the first systematic examination of fairness in MAS from an SDLC perspective, employing a rapid review methodology to analyze 18 relevant studies. Integrating fairness evaluation metrics—such as bias benchmarks and group disparities—with MAS-specific concepts like consensus and bias amplification, the work systematically characterizes definitions, assessment approaches, and harm typologies related to MAS fairness. The analysis reveals three critical gaps: fragmented evaluation frameworks, limited generalizability, and absent governance mechanisms. Furthermore, it identifies four categories of fairness harms—representational, service-quality, safety-privacy, and governance-related—demonstrating that current research is insufficient to support deployable, fairness-aware software systems.
📝 Abstract
Transformer-based large language models (LLMs) and multi-agent systems (MAS) are increasingly embedded across the software development lifecycle (SDLC), yet their fairness implications for developer-facing tools remain underexplored despite their growing role in shaping what code is written, reviewed, and released. We present a rapid review of recent work on fairness in MAS, emphasizing LLM-enabled settings and relevance to software engineering. Starting from an initial set of 350 papers, we screened and filtered the corpus for relevance, retaining 18 studies for final analysis. Across these 18 studies, fairness is framed as a combination of trustworthy AI principles, bias reduction across groups, and interactional dynamics in collectives, while evaluation spans accuracy metrics on bias benchmarks, demographic disparity measures, and emergent MAS-specific notions such as conformity and bias amplification. Reported harms include representational, quality-of-service, security and privacy, and governance failures, which we relate to SDLC stages where evidence is most and least developed. We identify three persistent gaps: (1) fragmented, rarely MAS-specific evaluation practices that limit comparability, (2) limited generalization due to simplified environments and narrow attribute coverage, and (3) scarce, weakly evaluated mitigation and governance mechanisms aligned to real software workflows. These findings suggest MAS fairness research is not yet ready to support deployable, fairness-assured software systems, motivating MAS-aware benchmarks, consistent protocols, and lifecycle-spanning governance.