🤖 AI Summary
Existing flow-based language models rely on high-dimensional one-hot vectors, resulting in high training costs and semantically meaningless noise injection, which hinders their ability to generate high-quality text in large-vocabulary reasoning tasks. This work proposes S-FLM, the first continuous flow language model embedded in a hyperspherical latent space, where sequences are generated through deterministic rotations driven by a learned vector field via an ordinary differential equation (ODE), thereby avoiding explicit construction of one-hot representations. By modeling semantic structure with hyperspherical flows and combining cross-entropy optimization with low-temperature sampling, S-FLM significantly outperforms existing continuous flow models on mathematical and code reasoning benchmarks. At standard temperature, it approaches the performance of masked diffusion models, effectively narrowing the gap with autoregressive counterparts.
📝 Abstract
Discrete Diffusion Language Models progressed rapidly as an alternative to autoregressive (AR) models, motivated by their parallel generation abilities. However, for tractability, discrete diffusion models sample from a factorized distribution, which is less expressive than AR. Recent Flow Language Models (FLMs) apply continuous flows to language, transporting noise to data with a deterministic ODE that avoids factorized sampling. FLMs operate on one-hot vectors whose dimension scales with the vocabulary size, making FLMs costly to train. Moreover, since all distinct one-hot embeddings are equidistant in $\ell_2$, adding Gaussian noise does not have a clear semantic interpretation (unlike images, where Gaussian noise progressively degrades structure). We introduce $\mathbb{S}$-FLM, a latent FLM in the hypersphere. $\mathbb{S}$-FLM generates sequences by rotating vectors in $\mathbb{S}^{d-1}$ along a velocity field learned with cross-entropy, avoiding the overhead of materializing one-hot vectors. Previous FLMs match AR in Generative Perplexity (Gen.\ PPL), but samples with high likelihood are not necessarily correct in verifiable domains such as math and code. $\mathbb{S}$-FLM substantially improves continuous flow language models on large-vocabulary reasoning and closes the gap to masked diffusion under standard-temperature sampling ($T=1$), while a gap remains under optimized low-temperature ($T=0.1$) decoding.