Scaling Categorical Flow Maps

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work investigates the scalability of continuous flow matching approaches for generating discrete text in large-scale language modeling. By training a 1.7-billion-parameter base flow model and self-distilling it on 2.1 trillion tokens, the study presents the first validation of Categorical Flow Maps (CFM) at billion-parameter scale. The method integrates Gaussian-to-one-hot mapping, optimized time scheduling, and loss weighting strategies to produce high-quality, diverse text in as few as four inference steps, achieving token entropy closely matching that of the true data distribution. The paper further introduces a likelihood bound under a semi-discrete formulation to enable standard language modeling evaluation and highlights critical challenges in scheduling and weighting during large-scale training. Empirically, CFM matches the performance of discrete diffusion methods on mainstream language modeling benchmarks.

📝 Abstract

Continuous diffusion and flow matching models could represent a powerful alternative to autoregressive approaches for language modelling (LM), as they unlock a host of advantages currently reserved for continuous modalities, including accelerated sampling and tilting. Recently, several works have demonstrated the possibility of generating discrete data continuously by a simple flow matching process between a Gaussian and the one-hot encoded data distribution. They have further shown the feasibility of accelerated sampling via Categorical Flow Maps (CFMs), resulting in competitive sample quality in the few-step regime. However, this method had only been evaluated at relatively modest scales ($<1$B), leaving the question of its scalability completely open. In this article, we train a $1.7$B-parameter base flow model on $2.1$T tokens and self-distill it into a CFM that generates diverse, high-quality text in as few as $4$ inference steps while maintaining near-data-level token entropy. Furthermore, we introduce a likelihood bound for CFMs in the semi-discrete setting, and show that they can be used to score the model on standard LM benchmarks, achieving results in the same range as discrete diffusion methods. Finally, we uncover some of the challenges that arise from training these models at scale, and we provide prescriptive insights on loss weighting and time scheduling.

Problem

Research questions and friction points this paper is trying to address.

Categorical Flow Maps

scalability

language modeling

flow matching

large-scale models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Categorical Flow Maps

flow matching

large-scale language modeling