Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory-bandwidth bottlenecks in large language model (LLM) inference, this paper proposes a quantization-aware ternarization training framework, introducing 2-bit and 1.6-bit weight compression schemes to achieve efficient ternary weight representation. Theoretical analysis and empirical evaluation reveal that ternary models exhibit distinct scaling behavior—performance depends more critically on training data volume than on parameter count. Leveraging these insights, we design TriRun, a custom GPU inference kernel optimized for ternary operations, and release Spectra-1.1, a family of high-performance ternary LLMs trained on 1.2 trillion tokens. Evaluated across hardware platforms, Spectra-1.1 achieves substantial speedups: significant throughput improvement on CPU and up to 5× end-to-end acceleration on GPU, with minimal accuracy degradation. This work establishes a scalable, high-throughput paradigm for ultra-low-bit LLM inference.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Also, building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 times compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the Spectra-1.1 suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.
Problem

Research questions and friction points this paper is trying to address.

Addressing memory bottlenecks in large language model inference
Exploring ternary language models for efficient memory usage
Developing accelerated inference techniques for ternary models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary language models reduce memory requirements
2-bit and 1.6-bit packing accelerate inference
TriRun GPU kernel speeds up inference 5x
🔎 Similar Papers
No similar papers found.
Tejas Vaidhya
Tejas Vaidhya
CEO of Nolano
Machine LearningNatural Language processingComputer Vision
A
Ayush Kaushal
Nolano AI, Mila- Quebec AI institute, Université de Montréal
Vineet Jain
Vineet Jain
McGill University, Mila
F
Francis Couture Harpin
École de technologie supérieure, Université du Québec
P
Prashant Shishodia
Google, India
M
Majid Behbahani
Morgan Stanley
Y
Yuriy Nevmyvaka
Morgan Stanley
Irina Rish
Irina Rish
University of Montreal / Mila -Quebec AI Institute
Artificial IntelligenceMachine LearningNeuroscience