Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces

📅 2025-07-13

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study investigates the geometric structure of semantic representations in the latent space of large language models (LLMs). It remains unclear how semantic information is geometrically organized across layers and models, particularly under complex reasoning conditions. Method: Through large-scale empirical analysis of hidden states across multiple LLMs and Transformer decoder layers, we identify that high-level semantics concentrate densely in low-dimensional linear subspaces and exhibit cross-domain linear separability—enhanced markedly in deeper layers and under structured reasoning prompts (e.g., chain-of-thought). We propose *Semantic Direction Encoding*, wherein intricate reasoning patterns are represented as simple directional vectors in latent space. Building on this, we design a geometry-aware intervention: a lightweight MLP trained as a latent-space “guardrail” to detect adversarial and malicious prompts. Results: Our method achieves high accuracy on detection benchmarks, validating a novel implicit control paradigm grounded in subspace geometry.

Technology Category

Application Category

📝 Abstract

Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. aturay{However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors$unicode{x2013}$even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.

Problem

Research questions and friction points this paper is trying to address.

Understand LLM semantic representation organization

Explore low-dimensional subspaces for semantic separability

Develop geometry-aware tools for content detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-dimensional subspaces encode semantic information

Linear separability improves with deeper layers

Geometry-aware tools detect adversarial content

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models