🤖 AI Summary
Efficient low-bit quantization of pretrained real-valued large language models remains challenging due to substantial accuracy degradation and the need for costly retraining.
Method: This paper proposes the first training-free, general-purpose complex-number quantization framework. It establishes theoretical equivalence between generalized linear complex representations and phase-aware quantization, mapping real-valued Transformer layers into equivalent complex forms. The framework introduces a fourth-root-of-unity codebook, MAC-free inference, and a recursive residual quantization strategy.
Contribution/Results: It achieves plug-and-play conversion of pretrained real-valued models (e.g., LLaMA-2 7B) into ultra-low-bit complex-domain models—equivalent to 2-bit quantization—without any fine-tuning. The resulting models retain over 98% of full-precision performance on standard benchmarks, significantly outperforming state-of-the-art real-valued binary and ternary methods. Moreover, the design ensures hardware efficiency and seamless compatibility with existing model-serving ecosystems.
📝 Abstract
Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.