Learning is Forgetting: LLM Training As Lossy Compression

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the ambiguity in the representation space structure of large language models, which hinders understanding of their learning mechanisms and connections to human cognition. Viewing model training as a lossy compression process, we propose a unified information-theoretic framework grounded in the Information Bottleneck principle to characterize how models balance task-relevant information retention against approaching the optimal compression frontier. Through representational analysis, cross-model compression evaluation, and comparisons across large-scale training datasets, we empirically validate on multiple open-source large language models that distinct training strategies yield significantly different compression behaviors. Moreover, we demonstrate that proximity to compression optimality effectively predicts model performance across a broad range of downstream tasks, revealing that the structure of internal representations is a reliable indicator of model capability.

Technology Category

Application Category

📝 Abstract

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.

Problem

Research questions and friction points this paper is trying to address.

large language models

representational structure

lossy compression

Information Bottleneck

model interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

lossy compression

Information Bottleneck

representation learning