Continuous Autoregressive Language Models

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from low inference efficiency due to their token-by-token autoregressive generation paradigm. To address this, we propose Continuous Autoregressive Language Modeling (CALM), the first framework that reformulates language modeling as continuous vector sequence prediction—bypassing discrete token forecasting entirely. CALM employs a high-fidelity deep autoencoder to compress K consecutive tokens into a single high-dimensional continuous vector, reducing generation steps by a factor of K while preserving semantic fidelity. It introduces an end-to-end trainable architecture without likelihood estimation and supports controllable decoding, enabling both high-accuracy reconstruction (>99.9% token recovery) and efficient sampling. Experiments demonstrate that CALM matches or exceeds the performance of state-of-the-art discrete LLMs at substantially lower computational cost, achieving significant gains in compute efficiency—measured as performance per FLOP.

Technology Category

Application Category

📝 Abstract
The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.
Problem

Research questions and friction points this paper is trying to address.

Overcoming token-by-token generation bottleneck in large language models
Increasing semantic bandwidth through continuous next-vector prediction
Improving performance-compute trade-off with compressed continuous representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous next-vector prediction replaces discrete token generation
High-fidelity autoencoder compresses multiple tokens into one vector
Likelihood-free framework enables robust training in continuous domain
Chenze Shao
Chenze Shao
Tencent
Machine TranslationNatural Language ProcessingDeep Learning
D
Darren Li
WeChat AI, Tencent Inc; Qiuzhen College, Tsinghua University
Fandong Meng
Fandong Meng
WeChat AI, Tencent
Machine TranslationNatural Language Processing
J
Jie Zhou
WeChat AI, Tencent Inc