🤖 AI Summary
This work investigates the unclear mechanisms behind the strong generalization of large language models (LLMs) even when their parameter counts vastly exceed training data volume. We propose Sparse Semantic Dimensions (SSD) as a verifiable measure of generalization, derived from activation-based feature dictionaries extracted via sparse autoencoders (SAEs). Our analysis reveals that LLM generalization stems from low-dimensional, sparse structures in internal representations rather than sheer parameter scale. We identify a “feature sharpness” scaling law and leverage the phenomenon of feature explosion to enable effective out-of-distribution detection. Experiments on GPT-2 Small and Gemma-2B demonstrate that larger models exhibit more compressible semantic structures, and that SSD provides reliable uncertainty quantification along with non-vacuous generalization guarantees.
📝 Abstract
Standard statistical learning theory predicts that Large Language Models (LLMs) should overfit because their parameter counts vastly exceed the number of training tokens. Yet, in practice, they generalize robustly. We propose that the effective capacity controlling generalization lies in the geometry of the model's internal representations: while the parameter space is high-dimensional, the activation states lie on a low-dimensional, sparse manifold. To formalize this, we introduce the Sparse Semantic Dimension (SSD), a complexity measure derived from the active feature vocabulary of a Sparse Autoencoder (SAE) trained on the model's layers. Treating the LLM and SAE as frozen oracles, we utilize this framework to attribute the model's generalization capabilities to the sparsity of the dictionary rather than the total parameter count. Empirically, we validate this framework on GPT-2 Small and Gemma-2B, demonstrating that our bound provides non-vacuous certificates at realistic sample sizes. Crucially, we uncover a counter-intuitive"feature sharpness"scaling law: despite being an order of magnitude larger, Gemma-2B requires significantly fewer calibration samples to identify its active manifold compared to GPT-2, suggesting that larger models learn more compressible, distinct semantic structures. Finally, we show that this framework functions as a reliable safety monitor: out-of-distribution inputs trigger a measurable"feature explosion"(a sharp spike in active features), effectively signaling epistemic uncertainty through learned feature violation. Code is available at: https://github.com/newcodevelop/sparse-semantic-dimension.