Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Controlling multiple behavioral attributes simultaneously in large language model (LLM) inference remains challenging due to restrictive linear intervention assumptions, inter-attribute interference, and the need to store separate direction vectors per attribute. Method: We propose K-Steering—a non-linear, fine-tuning-free framework that replaces additive linear interventions with a single multi-label classifier. It dynamically generates composite intervention directions in real time by leveraging gradients of hidden-layer activations. Contributions/Results: We introduce two novel benchmarks—ToneBank and DebateMix—evaluated via both activation-based classifiers and LLM-based adjudication. On three mainstream LLMs, K-Steering significantly outperforms linear baselines in multi-attribute joint control, achieving state-of-the-art performance. The framework demonstrates strong generalization, precise controllability, and practical engineering viability.

Technology Category

Application Category

📝 Abstract

Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, ToneBank and DebateMix, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.

Problem

Research questions and friction points this paper is trying to address.

Control multiple behavioral attributes in LLMs

Overcome interference between attributes and linear limitations

Enable dynamic behavior composition without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-linear multi-label classifier for hidden activations

Gradient-based intervention directions at inference

Dynamic behavior composition without retraining

🔎 Similar Papers

No similar papers found.