TernaryLM: Memory-Efficient Language Modeling via Native 1-Bit Quantization with Adaptive Layer-wise Scaling

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of deploying large language models on resource-constrained devices due to their high computational and memory demands. To this end, the authors propose TernaryLM, the first natively 1-bit ternary (using values in \{-1, 0, +1\}) Transformer model trained from scratch with 132 million parameters. The approach introduces a layer-wise adaptive scaling mechanism and a straight-through estimator (STE) to enable stable training. Experimental results show that TernaryLM achieves a perplexity of 58.42 on TinyStories and an F1 score of 82.47% on the MRPC task, while reducing memory usage by 2.4× (498 MB vs. 1197 MB) with comparable inference latency. The study further reveals the high tolerance of intermediate layers to extreme quantization, offering insights for non-uniform precision design.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource-constrained environments. We present TernaryLM, a 132M parameter transformer architecture that employs native 1-bit ternary quantization {-1, 0, +1} during training, achieving significant memory reduction without sacrificing language modeling capability. Unlike post-training quantization approaches that quantize pre-trained full-precision models, TernaryLM learns quantization-aware representations from scratch using straight-through estimators and adaptive per-layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories; (2) downstream transfer with 82.47 percent F1 on MRPC paraphrase detection; (3) 2.4x memory reduction (498MB vs 1197MB) with comparable inference latency; and (4) stable training dynamics across diverse corpora. We provide layer-wise quantization analysis showing that middle transformer layers exhibit highest compatibility with extreme quantization, informing future non-uniform precision strategies. Our results suggest that native 1-bit training is a promising direction for efficient neural language models. Code is available at https://github.com/1nisharg/TernaryLM-Memory-Efficient-Language-Modeling.

Problem

Research questions and friction points this paper is trying to address.

large language models

memory efficiency

quantization

edge deployment

resource-constrained environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

1-bit quantization

ternary neural networks

quantization-aware training