Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address memory-bandwidth bottlenecks in large language model (LLM) inference, this paper proposes a quantization-aware ternarization training framework, introducing 2-bit and 1.6-bit weight compression schemes to achieve efficient ternary weight representation. Theoretical analysis and empirical evaluation reveal that ternary models exhibit distinct scaling behavior—performance depends more critically on training data volume than on parameter count. Leveraging these insights, we design TriRun, a custom GPU inference kernel optimized for ternary operations, and release Spectra-1.1, a family of high-performance ternary LLMs trained on 1.2 trillion tokens. Evaluated across hardware platforms, Spectra-1.1 achieves substantial speedups: significant throughput improvement on CPU and up to 5× end-to-end acceleration on GPU, with minimal accuracy degradation. This work establishes a scalable, high-throughput paradigm for ultra-low-bit LLM inference.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Also, building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 times compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the Spectra-1.1 suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.

Problem

Research questions and friction points this paper is trying to address.

Addressing memory bottlenecks in large language model inference

Exploring ternary language models for efficient memory usage

Developing accelerated inference techniques for ternary models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary language models reduce memory requirements

2-bit and 1.6-bit packing accelerate inference

TriRun GPU kernel speeds up inference 5x

🔎 Similar Papers

No similar papers found.