Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This study investigates the high susceptibility of large language models (LLMs) to conforming to incorrect group consensus in multi-agent settings, a phenomenon termed high "conformity rate." Through experiments across four model families and activation patching analyses, the authors demonstrate that this behavior does not stem from reinforcement learning from human feedback (RLHF)-induced sycophancy but rather from group pressure suppressing the models’ intrinsic reasoning capabilities—even pre-trained base models exhibit equal or greater conformity. The issue is localized to attention mechanisms in intermediate layers. The work introduces a structured dissent mechanism, showing that a single correct dissenter reduces conformity by 54–73 percentage points. Patching critical layers recovers 96% of the lost accuracy, and a 47.5-percentage-point conformity gap under majority consensus persists robustly across varying jury sizes.

📝 Abstract

LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes $N \in \{4, 5, 6\}$. Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54-73 percentage points across all framings tested, whereas the strongest prompt-level defense fails on attack variants outside its design surface. Mitigations should target the mechanism, structured dissent at the pipeline level, rather than prompt-level defenses.

Problem

Research questions and friction points this paper is trying to address.

multi-agent sycophancy

RLHF

yield

alignment

LLM

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation patching

multi-agent sycophancy

alignment vulnerability