Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing alignment methods for large language model inference typically rely on open-loop activation interventions, which ignore inter-layer perturbation propagation and lack online feedback, thereby limiting control efficacy. This work reveals for the first time that Transformer cross-layer dynamics exhibit strong local linearity, enabling the inference process to be modeled as a linear time-varying system. Building upon this insight, we formulate a closed-loop Linear Quadratic Regulator (LQR) using inter-layer Jacobian matrices to achieve fine-grained behavioral control without any training. Coupled with an adaptive semantic setpoint generator, our approach significantly outperforms existing activation-based guidance baselines in tasks such as toxicity suppression, truthfulness enhancement, refusal behavior control, and arbitrary concept steering, all while incurring minimal computational overhead and providing theoretical error bounds.

Technology Category

Application Category

📝 Abstract

Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering

Problem

Research questions and friction points this paper is trying to address.

activation steering

inference-time alignment

transformer dynamics

closed-loop control

LLM alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation steering

linear quadratic regulator

local linearity