🤖 AI Summary
This study addresses the core challenge of evaluating, interpreting, and regulating ethical behavior in multi-agent large language models (MALMs). Methodologically, it introduces a mechanism interpretability-driven paradigm for ethical alignment, integrating neuron-level interpretability analysis with parameter-efficient fine-tuning to identify and intervene in ethics-relevant behavioral pathways—spanning individual decision-making, inter-agent negotiation, and system-level emergent phenomena—without compromising task performance. Key contributions include: (1) a three-tiered ethical evaluation framework covering individual, interactive, and systemic levels; (2) mechanistic insights into how ethics-related emergent behaviors arise in MALMs; and (3) the establishment of an “interpretability–alignment” co-governance research agenda. The work provides foundational theoretical and practical support for developing trustworthy, collaborative autonomous agent systems.
📝 Abstract
Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.