Linear representations in language models can change dramatically over a conversation

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the stability of linear representations of high-level semantic concepts in language models during dialogue and their implications for interpretability and intervention methods. Employing representation probing, directional steering, and cross-model dialogue replay techniques, the work systematically analyzes the dynamic evolution of representations across different model layers and architectures. It reveals, for the first time, that dialogue context substantially shifts the representational directions of semantic attributes such as factuality, whereas representations of general knowledge remain relatively stable. Furthermore, representational dynamics are shown to be sensitive to role-playing contexts, directly influencing the efficacy of interventions. These findings challenge the common assumption of static representations and provide a new foundation for dynamic interpretability and context-aware intervention strategies.

Technology Category

Application Category

📝 Abstract
Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.
Problem

Research questions and friction points this paper is trying to address.

linear representations
factuality
conversational dynamics
language model interpretability
context adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

representational dynamics
linear representations
factuality
context adaptation
language models
🔎 Similar Papers
No similar papers found.