🤖 AI Summary
This work proposes integrating “steering”—a technique that guides model behavior at inference time by intervening on internal activations—into a unified framework for language model adaptation. Addressing the lack of systematic comparisons among existing adaptation methods, the authors formally define steering as a distinct paradigm characterized by parameter-free operation, reversibility, and local controllability. They establish a comprehensive taxonomy encompassing both parameter-update and activation-intervention approaches and introduce functional evaluation criteria to systematically analyze the fundamental similarities and differences between steering, fine-tuning, prompting, and related techniques. This study thus provides a more complete theoretical foundation and practical guidance for adapting large language models.
📝 Abstract
Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods.
In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing clarifies how steering relates to existing methods, motivating a unified taxonomy for model adaptation.