🤖 AI Summary
Existing ultra-low-bit (<1-bit) compression methods for large language models are hindered by a geometric mismatch between latent representations and the binary hypercube, preventing them from approaching theoretical performance limits. This work identifies high coherence among latent variables as the key cause of this mismatch and introduces a Joint Iterative Quantization (Joint-ITQ) framework. By applying an internal latent rotation to geometrically pre-align representations prior to low-rank binarization, Joint-ITQ effectively unlocks spectral energy gains. The method incurs no additional inference overhead and establishes a new state of the art for sub-1-bit compression at 0.1–1 bits per parameter on Llama-2 and Llama-3, matching the performance of current best 1-bit approaches.
📝 Abstract
We identify the Spectral Energy Gain in extreme model compression, where low-rank binary approximations outperform tiny-rank floating-point baselines for heavy-tailed spectra. However, prior attempts fail to realize this potential, trailing state-of-the-art 1-bit methods. We attribute this degradation to Latent Geometry Misalignment: standard singular vectors exhibit high coherence (spiky distribution), the worst-case geometry for binary quantization. To realize this gain, we propose LittleBit-2, a framework employing Internal Latent Rotation and Joint Iterative Quantization (Joint-ITQ). This approach acts as a geometric preconditioner, aligning coherent latent distributions with the binary hypercube with zero inference overhead. Empirically, LittleBit-2 establishes a new state-of-the-art in the sub-1-bit regime (1$\sim$0.1 bpp) on Llama-2 and Llama-3, matching the fidelity of leading 1-bit baselines.