Compositional Steering of Large Language Models with Steering Tokens

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing approaches struggle to achieve compositional control over multiple behaviors in large language models. This work proposes Compositional Steering Tokens, a method that distills natural language instructions into dedicated input tokens and trains composable behavior tokens within the input token space, enabling zero-shot compositional control of multiple behaviors for the first time. The approach generalizes to unseen combinations and quantities of behaviors, overcoming the limitations of conventional activation-based steering. Experimental results demonstrate that the method significantly outperforms baseline techniques—including instruction prompting, activation steering, and LoRA fusion—across multiple mainstream large language model architectures. Furthermore, combining the proposed tokens with natural language instructions yields additional improvements in controllability.

Technology Category

Application Category

📝 Abstract

Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.

Problem

Research questions and friction points this paper is trying to address.

compositional steering

large language models

multi-behavior control

steering tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional steering

steering tokens

self-distillation

zero-shot composition