🤖 AI Summary
This study investigates how language models perceive and execute vision-like line-breaking tasks—such as fixed-width text wrapping—using only sequences of textual tokens. Through mechanistic analysis of Claude 3.5 Haiku, we find that the model encodes character counts in early layers as a low-dimensional curved manifold and applies geometric transformations to this manifold via attention mechanisms to ultimately construct a linear decision boundary for determining line breaks. Innovatively drawing an analogy between this manifold geometry and sparse feature representations in biological place cells, our work integrates geometric and feature-based perspectives to explain the model’s perception and decision-making. Combining causal interventions, manifold analysis, attention head dissection, and visualization, we not only validate the manifold-based counting mechanism but also achieve precise control over model behavior and uncover visual-illusion-like token sequences that can reliably disrupt this capability.
📝 Abstract
Language models can perceive visual properties of text despite receiving only sequences of tokens-we mechanistically investigate how Claude 3.5 Haiku accomplishes one such task: linebreaking in fixed-width text. We find that character counts are represented on low-dimensional curved manifolds discretized by sparse feature families, analogous to biological place cells. Accurate predictions emerge from a sequence of geometric transformations: token lengths are accumulated into character count manifolds, attention heads twist these manifolds to estimate distance to the line boundary, and the decision to break the line is enabled by arranging estimates orthogonally to create a linear decision boundary. We validate our findings through causal interventions and discover visual illusions--character sequences that hijack the counting mechanism. Our work demonstrates the rich sensory processing of early layers, the intricacy of attention algorithms, and the importance of combining feature-based and geometric views of interpretability.