🤖 AI Summary
This work addresses the instability and representational collapse observed in ultra-low-precision training of large language models (LLMs), which arises from a fundamental conflict between discrete quantization and the heavy-tailed spectral structure inherent in linguistic data. For the first time, we establish a theoretical link between Zipf’s law and random matrix theory, demonstrating that the power-law decay of the embedding spectrum is essential for semantic encoding. We show that uniform quantization truncates the spectral tail by injecting noise, leading to spectral flattening and an increase in stable rank. Through singular value spectrum analysis, quantization noise modeling, and stable rank theory, we empirically validate the causal relationship between quantization-induced spectral distortion and representational collapse in models such as GPT-2 and TinyLlama. We propose “spectral fidelity” as a necessary condition for stable low-bit training, offering a theoretical foundation for efficient ultra-low-precision LLM optimization.
📝 Abstract
Training Large Language Models (LLMs) at ultra-low precision is critically impeded by instability rooted in the conflict between discrete quantization constraints and the intrinsic heavy-tailed spectral nature of linguistic data. By formalizing the connection between Zipfian statistics and random matrix theory, we prove that the power-law decay in the singular value spectra of embeddings is a fundamental requisite for semantic encoding. We derive theoretical bounds showing that uniform quantization introduces a noise floor that disproportionately truncates this spectral tail, which induces spectral flattening and a strictly provable increase in the stable rank of representations. Empirical validation across diverse architectures including GPT-2 and TinyLlama corroborates that this geometric degradation precipitates representational collapse. This work not only quantifies the spectral sensitivity of LLMs but also establishes spectral fidelity as a necessary condition for stable low-bit optimization.