How Does Controllability Emerge In Language Models During Pretraining?

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work investigates the dynamic emergence of linear controllability—e.g., sentiment, style—during language model pretraining. Addressing the lack of principled intervention conditions and reliance on heuristic trial-and-error, we propose the Intervention Detector (ID), a unified framework that systematically uncovers the co-evolutionary relationship between linear controllability and linear separability in the latent space. Leveraging linear interventions on hidden states and ID-derived metrics—including attention heatmaps, entropy trends, and cosine similarity—we conduct cross-stage validation across multiple model families. We find that controllability markedly increases during mid-training and that distinct semantic concepts (e.g., anger vs. sadness) emerge asynchronously. Critically, ID metrics strongly correlate with actual generation-level control performance, providing an interpretable, dynamics-aware foundation for controllable text generation.

Technology Category

Application Category

📝 Abstract

Language models can be steered by modifying their internal representations to control concepts such as emotion, style, or truthfulness in generation. However, the conditions for an effective intervention remain unclear and are often validated through heuristics and trial-and-error. To fill this gap, we demonstrate that intervention efficacy, measured by linear steerability (i.e., the ability to adjust output via linear transformations of hidden states), emerges during intermediate stages of training. Moreover, even closely related concepts (e.g., anger and sadness) exhibit steerability emergence at distinct stages of training. To better interpret the dynamics of steerability during training, we adapt existing intervention techniques into a unified framework, referred to as the "Intervention Detector" (ID), which is designed to reveal how linear steerability evolves over the course of training through hidden state and representation analysis. ID reveals that concepts become increasingly linearly separable in the hidden space as training progresses, which strongly correlates with the emergence of linear steerability. We further introduce ID-based metrics, such as heatmaps, entropy trends, and cosine similarity, to help interpret how linear steerability evolves throughout training. In addition, we apply ID across different model families to ensure the generality of our findings on steerability dynamics.

Problem

Research questions and friction points this paper is trying to address.

How does controllability emerge in language models during pretraining

Conditions for effective intervention in model representations unclear

Interpreting dynamics of linear steerability across training stages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear steerability emerges during intermediate training stages

Intervention Detector (ID) framework analyzes hidden state dynamics

ID metrics track steerability evolution via heatmaps and entropy

🔎 Similar Papers

No similar papers found.