🤖 AI Summary
This study challenges the conventional view of fairness as a static, individual attribute in language models, which proves inadequate for addressing dynamic ethical conflicts in multi-agent interactions. The authors propose that fairness emerges as a procedural property through collaborative deliberation among agents. To investigate this, they design a structured three-round, two-agent debate framework within a hospital triage scenario, integrating retrieval-augmented generation (RAG) to enable ethically aligned and controlled simulation experiments. Their findings reveal that individual agents consistently violate ethical allocation principles, whereas joint decisions reached through adversarial negotiation satisfy fairness criteria unattainable by any single model. Furthermore, aligned agents partially restore equity for marginalized groups while exposing inherent framing biases. This work pioneers the application of Arrow’s Impossibility Theorem to AI fairness, underscoring negotiation—not override—as central to effective bias mitigation.
📝 Abstract
Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval-augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent's allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left-leaning tendencies in LLMs. We connect these limits to Arrow's Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi-agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.