🤖 AI Summary
This work investigates the relationship between language models’ predictive capability—measured by perplexity—and the dispersion of their latent-space representations, quantified as the average cosine distance among hidden vectors. To address the challenge of unsupervised model evaluation and optimization, we propose representation dispersion computed on unlabeled data as a proxy metric for downstream performance, enabling principled selection of optimal retrieval layers and guiding model selection and fine-tuning. We further introduce a repulsion-based training objective to explicitly enhance representation dispersion. Extensive experiments across multiple model families—including LLaMA and Qwen—and diverse text domains demonstrate a strong negative correlation between dispersion and perplexity; moreover, increasing dispersion consistently improves both in-domain and cross-domain language modeling performance. This study is the first to systematically establish the decisive role of latent-space geometric properties—specifically, representation dispersion—in governing language modeling efficacy, thereby establishing a novel paradigm for unsupervised model assessment and optimization.
📝 Abstract
We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion - the average pairwise cosine distance among hidden vectors - strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks without requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a data-efficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple push-away objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each.