Scaling laws for activation steering with Llama 2 models and refusal mechanisms

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the scalability of Contrastive Activation Addition (CAA) across the Llama 2 family (7B/13B/70B), focusing on its diminishing efficacy in steering refusal behavior. Methodologically, we intervene in the residual stream via direction vectors derived from contrastive token pairs, systematically evaluating the impact of model scale, transformer layer position, and guidance direction (positive vs. negative). Results reveal: (i) CAA effectiveness degrades significantly with increasing model size; (ii) negative-direction guidance exhibits greater robustness than positive-direction guidance; and (iii) optimal intervention layers concentrate in the mid-to-early transformer blocks. To our knowledge, this is the first study to empirically uncover scaling laws and layer-wise biases in activation-based steering—providing foundational evidence and design insights for scalable alignment techniques in large language models.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) evolve in complexity and capability, the efficacy of less widely deployed alignment techniques are uncertain. Building on previous work on activation steering and contrastive activation addition (CAA), this paper explores the effectiveness of CAA with model scale using the family of Llama 2 models (7B, 13B, and 70B). CAA works by finding desirable 'directions' in the model's residual stream vector space using contrastive pairs (for example, hate to love) and adding this direction to the residual stream during the forward pass. It directly manipulates the residual stream and aims to extract features from language models to better control their outputs. Using answer matching questions centered around the refusal behavior, we found that 1) CAA is most effective when applied at early-mid layers. 2) The effectiveness of CAA diminishes with model size. 3) Negative steering has more pronounced effects than positive steering across all model sizes.
Problem

Research questions and friction points this paper is trying to address.

Effectiveness of activation steering with Llama 2 models
Impact of model scale on contrastive activation addition
Comparison of positive and negative steering effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses contrastive activation addition (CAA) technique
Manipulates residual stream vector space
Applies CAA at early-mid layers
🔎 Similar Papers
S
Sheikh Abdur Raheem Ali
Trajectory Labs
J
Justin Xu
Stanford University
Ivory Yang
Ivory Yang
Dartmouth College
Machine LearningNLPMultilingual ModelsEndangered Languages
J
Jasmine Xinze Li
Cornell University
A
Ayse Arslan
University of Oxford
C
Clark Benham
MATS