A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenges of behavioral alignment in large language models, where full-model fine-tuning often induces distributional shifts, lacks interpretability, and struggles to precisely correct undesirable behaviors such as sycophancy. To overcome these limitations, the authors propose a neuron-level precise intervention mechanism that leverages sparse autoencoders and linear probes to identify the top 3% of MLP neurons most relevant to the target behavior. By decoding these neurons in residual space and applying gradient masking to fine-tune only this sparse subset, the method achieves state-of-the-art or competitive performance across four benchmarks—Syco-Bench, NLP, POLI, and PHIL—on Gemma-2-2B and Gemma-2-9B models. This approach significantly reduces data dependency while enhancing alignment efficiency, interpretability, and scalability.

Technology Category

Application Category

📝 Abstract

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable and precise alternative to full-model fine-tuning, remaining effective even in situations when little data is available

Problem

Research questions and friction points this paper is trying to address.

behavioral alignment

sycophancy

large language models

fine-tuning side effects

neuron-level intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse autoencoders

neuron-level fine-tuning

behavioral alignment