🤖 AI Summary
To address memory-bandwidth bottlenecks in large language model (LLM) inference, this paper proposes a quantization-aware ternarization training framework, introducing 2-bit and 1.6-bit weight compression schemes to achieve efficient ternary weight representation. Theoretical analysis and empirical evaluation reveal that ternary models exhibit distinct scaling behavior—performance depends more critically on training data volume than on parameter count. Leveraging these insights, we design TriRun, a custom GPU inference kernel optimized for ternary operations, and release Spectra-1.1, a family of high-performance ternary LLMs trained on 1.2 trillion tokens. Evaluated across hardware platforms, Spectra-1.1 achieves substantial speedups: significant throughput improvement on CPU and up to 5× end-to-end acceleration on GPU, with minimal accuracy degradation. This work establishes a scalable, high-throughput paradigm for ultra-low-bit LLM inference.
📝 Abstract
Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Also, building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 times compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the Spectra-1.1 suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.