The Origins of Representation Manifolds in Large Language Models

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

256K/year

🤖 AI Summary

This work investigates whether large language models’ (LLMs) neural representations form continuous, multidimensional concept manifolds—contrary to classical sparse linear feature assumptions. Addressing the core interpretability question of how embeddings map to human-understandable concepts, we propose and empirically validate that cosine similarity in representation space approximates intrinsic geodesic distance on the underlying concept manifold, thereby systematically linking geometric distance to semantic relatedness. Methodologically, we integrate geometric analysis of text embeddings and neuron activations, manifold learning validation, comparative experiments with sparse autoencoders, and cosine-path modeling. For the first time across multiple LLMs, we demonstrate statistically significant manifold structure in both embedding and activation spaces, with cosine distance robustly reflecting conceptual semantic differences. Our findings challenge the linear representational paradigm and establish a theoretical bridge between representation geometry and conceptual semantics, introducing a novel geometric framework for interpretable AI.

Technology Category

Application Category

📝 Abstract

There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.

Problem

Research questions and friction points this paper is trying to address.

Understanding how neural representations encode human-interpretable features

Exploring continuous and multidimensional feature values in representation space

Validating the connection between representation distance and concept relatedness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders recover features from representations

Neural representations encode continuous multidimensional feature values

Cosine similarity links representation space to concept space

🔎 Similar Papers

Emergence of a High-Dimensional Abstraction Phase in Language Transformers