🤖 AI Summary
Current language models face bottlenecks in complex structural understanding, cross-lingual generalization, and zero-/few-shot in-context reasoning, while suffering from suboptimal parameter and computational efficiency. This paper introduces Latent-Thought Language Models (LTMs), the first framework to explicitly model latent-thought vectors with a learned prior, guiding token generation via a Transformer decoder and trained using dual-rate variational Bayesian optimization. Our key contributions are: (1) identifying latent-variable dimensionality—not just parameter count—as a novel axis for model scaling; (2) proposing an efficient “reasoning-steps-for-size” scaling paradigm; and (3) enabling emergent few-shot in-context reasoning driven by latent-variable growth. Experiments demonstrate that LTMs significantly outperform autoregressive and discrete diffusion baselines in perplexity and zero-shot language modeling, achieving state-of-the-art few-shot reasoning and text generation with superior parameter and sample efficiency.
📝 Abstract
We propose a novel family of language models, Latent-Thought Language Models (LTMs), which incorporate explicit latent thought vectors that follow an explicit prior model in latent space. These latent thought vectors guide the autoregressive generation of ground tokens through a Transformer decoder. Training employs a dual-rate optimization process within the classical variational Bayes framework: fast learning of local variational parameters for the posterior distribution of latent vectors, and slow learning of global decoder parameters. Empirical studies reveal that LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space. Higher sample efficiency can be achieved by increasing training compute per token, with further gains possible by trading model size for more inference steps. Designed based on these scaling properties, LTMs demonstrate superior sample and parameter efficiency compared to conventional autoregressive models and discrete diffusion models. They significantly outperform these counterparts in validation perplexity and zero-shot language modeling. Additionally, LTMs exhibit emergent few-shot in-context reasoning capabilities that scale with model and latent size, and achieve competitive performance in conditional and unconditional text generation.