End-to-end Learning of Sparse Interventions on Activations to Steer Generation

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the challenge of jointly achieving controllability and efficiency in generative models, this paper proposes LinEAS: a framework for sparse, end-to-end learnable linear interventions applied directly at activation layers. Its core innovation lies in a global loss-driven mechanism for automatic neuron- and layer-level selection, integrated with inter-layer distribution alignment loss, L1/L0 sparsity regularization, and few-shot activation-space calibration. LinEAS supports intervention composition and cross-task transfer, significantly enhancing robustness and overcoming the limitations of conventional local interventions. Experiments demonstrate that, using only a few samples, LinEAS substantially outperforms existing intervention methods on text detoxification and text-to-image style control—achieving toxicity reduction on par with full-parameter fine-tuning. Moreover, it successfully transfers to diffusion models such as Stable Diffusion, delivering both low computational overhead and high generation quality.

Technology Category

Application Category

📝 Abstract

The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e.g., produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e.g., toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layerwise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron and layer selection. Empirically, LinEAS only requires a handful of samples to be effective, and beats similar baselines on toxicity mitigation, while performing on par with far more involved finetuning approaches. We show that LinEAS interventions can be composed, study the impact of sparsity on their performance, and showcase applications in text-to-image diffusions.

Problem

Research questions and friction points this paper is trying to address.

Control generative models to produce safe content.

Develop cheap, efficient mechanisms for model steering.

Address unintended shifts in model activations.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear end-to-end activation steering (LinEAS)

Global loss for layerwise distributional shifts

Sparsifying norms for neuron and layer selection

🔎 Similar Papers

No similar papers found.