COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing activation steering methods face a fundamental trade-off between sample efficiency and signal extraction capability, making precise control with few examples challenging. This work proposes a training-free, inference-time activation steering framework that dynamically guides large language model behavior by approximating—in a single step—the representational changes induced by gradient descent over in-context examples. The approach integrates unit kernel approximation, which updates activations directly via normalized gradients, with finite difference approximation, which simulates multi-example learning using only two forward passes. This significantly reduces reliance on example quantity. Experiments demonstrate that the method achieves up to 95% effectiveness across diverse steering tasks, requiring 50 times fewer examples than the strongest baseline, and exhibits flexible adaptation to heterogeneous human preferences in multi-objective alignment scenarios.

Technology Category

Application Category

📝 Abstract

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.

Problem

Research questions and friction points this paper is trying to address.

activation steering

large language models

in-context learning

sample efficiency

learning dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation steering

in-context learning

gradient approximation