On the Predictive Power of Representation Dispersion in Language Models

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the relationship between language models’ predictive capability—measured by perplexity—and the dispersion of their latent-space representations, quantified as the average cosine distance among hidden vectors. To address the challenge of unsupervised model evaluation and optimization, we propose representation dispersion computed on unlabeled data as a proxy metric for downstream performance, enabling principled selection of optimal retrieval layers and guiding model selection and fine-tuning. We further introduce a repulsion-based training objective to explicitly enhance representation dispersion. Extensive experiments across multiple model families—including LLaMA and Qwen—and diverse text domains demonstrate a strong negative correlation between dispersion and perplexity; moreover, increasing dispersion consistently improves both in-domain and cross-domain language modeling performance. This study is the first to systematically establish the decisive role of latent-space geometric properties—specifically, representation dispersion—in governing language modeling efficacy, thereby establishing a novel paradigm for unsupervised model assessment and optimization.

Technology Category

Application Category

📝 Abstract
We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion - the average pairwise cosine distance among hidden vectors - strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks without requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a data-efficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple push-away objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each.
Problem

Research questions and friction points this paper is trying to address.

Link between representation dispersion and language model perplexity
Predict downstream accuracy using dispersion on unlabeled text
Enhance perplexity via training with push-away dispersion objective
Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation dispersion predicts model perplexity
Disperson optimizes retrieval-based methods efficiently
Push-away objective enhances cross-domain performance
🔎 Similar Papers
No similar papers found.
Y
Yanhong Li
University of Chicago
M
Ming Li
University of Maryland
Karen Livescu
Karen Livescu
TTI-Chicago
speech and language processingmachine learning
J
Jiawei Zhou
Stony Brook University