🤖 AI Summary
Large language model (LLM)-based multi-agent systems (MAS) suffer from reliability deficits under instruction conflicts—e.g., system-user or peer-peer contradictions—due to misaligned hierarchical compliance: agents erroneously prioritize system-level constraints over user intent, and macroscopic metrics (e.g., pass@k) fail to expose such fine-grained violations.
Method: We propose a three-stage full-stack framework: (i) *Diagnosis*, via context-aware Role-following Score (CRAS); (ii) *Localization*, identifying attention drift concentrated in intermediate transformer layers; and (iii) *Alignment*, using SAIL—a lightweight method that fine-tunes only critical layers. SAIL integrates LoRA-based low-rank adaptation with token-weighted DPO for instruction-level alignment without full-model finetuning.
Contribution/Results: Evaluated on AutoGen and MedQA, SAIL improves instruction-following rate by 5.60%, significantly mitigating erroneous system-rule prioritization while preserving efficiency and scalability.
📝 Abstract
Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system-user, peer-peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) Diagnose - Contextualized Role Adherence Score (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align - Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning.