Towards Understanding Steering Strength

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This study investigates the mechanism by which steering strength—the intensity of representation-level interventions—affects large language model behavior during inference. Addressing the challenge of effectively tuning intervention strength to balance behavioral control with model performance, the work introduces the first systematic theoretical framework that integrates directional representation steering, output probability modeling, and cross-entropy analysis. Through rigorous theoretical derivation and empirical validation, it reveals a non-monotonic, nonlinear relationship between steering strength and both concept activation and output distribution. Experiments across eleven mainstream language models—from small-scale GPT variants to contemporary large architectures—demonstrate the high generality of these findings, offering both theoretical grounding and practical guidance for controllable text generation.

Technology Category

Application Category

📝 Abstract

A popular approach to post-training control of large language models (LLMs) is the steering of intermediate latent representations. Namely, identify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propositions exist to pick this direction, considerably less is understood about how to choose the magnitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model's performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token probability, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behaviors, including non-monotonic effects of steering strength. We validate our theoretical predictions empirically on eleven language models, ranging from a small GPT architecture to modern models.

Problem

Research questions and friction points this paper is trying to address.

steering strength

large language models

latent representations

inference control

perturbation magnitude

Innovation

Methods, ideas, or system contributions that make the work stand out.

steering strength

latent representation steering

theoretical analysis