Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This study investigates how safety-related behaviors—such as refusal responses—in aligned large language models emerge through reinforcement learning from human feedback (RLHF), focusing on the role of feedforward network (FFN) neurons. The authors propose a gradient-free perturbation probing framework requiring only two forward passes, which combines residual stream direction injection with analysis of the FFN-to-skip signal ratio to generate causal hypotheses and identify critical neurons without backpropagation. They discover two novel circuit structures—opposing and routing circuits—and demonstrate precise behavioral editing using an extremely small subset of neurons (e.g., 0.014%). In Qwen2-7B, ablating just 20 neurons eliminates sycophantic behavior, while amplifying 10 neurons boosts TruthfulQA factual accuracy from 52% to 88%. Targeted intervention on safety refusal templates alters response formatting in 80% of cases and reduces harmful compliance to 0.6%.

📝 Abstract

Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful compliance, 3 of 520 cases, all with disclaimers. Routing circuits appear for pre-training behaviors distributed through attention. For language selection, residual-stream direction injection switches English to Chinese output on 99.1 percent of 580 benchmark prompts in the 3 of 19 tested models that satisfy three observed conditions: bilingual training, FFN-to-skip signal ratio between 0.3 and 1.1, and linear representability. The same intervention fails on the other 16 models and on math, code, and factual circuits, defining the limits of directional steering. The FFN-to-skip signal ratio, computed from the same two forward passes, distinguishes the two structures and predicts the appropriate intervention. Circuit topology varies by architecture, from Qwen's concentrated FFN bottleneck to Gemma's normalization-shielded circuit. In Qwen3.5-2B, ablating 20 neurons eliminates multi-turn sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52 percent to 88 percent on 200 TruthfulQA prompts. These results show that perturbation probing offers mechanistic insight into RLHF-organized behavior and a practical toolkit for precision template-layer editing.

Problem

Research questions and friction points this paper is trying to address.

FFN behavioral circuits

perturbation probing

RLHF

circuit topology

aligned LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Perturbation Probing

FFN Behavioral Circuits

Opposition Circuits