🤖 AI Summary
This work addresses a key limitation in controllable language modeling: existing methods perform only point-wise interventions and fail to capture the inherent distributional characteristics of semantic concepts. We propose the first distribution-level controllable intervention paradigm. Instead of adjusting a single representation point within a concept subspace, our method jointly models the statistical distribution—e.g., variance—of the subspace and its neighborhood, enabling distributional transformation via learnable representation fine-tuning. This strategy proves especially effective in early Transformer layers, significantly enhancing both behavioral guidance fidelity and robustness during forward inference. Evaluated on eight commonsense reasoning and seven arithmetic reasoning benchmarks, our approach consistently outperforms state-of-the-art point-wise intervention methods. Results empirically validate that explicit distributional modeling is critical for improving controllability, establishing a new foundation for principled, distribution-aware intervention in large language models.
📝 Abstract
Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: href{https://github.com/chili-lab/D-Intervention}{https://github.com/chili-lab/D-Intervention}.