🤖 AI Summary
Current diffusion and flow-matching models are approaching performance saturation, while continuous autoregressive generation shows promise for unified multimodal modeling. This paper introduces Transition Matching (TM), a novel generative paradigm operating in discrete time but continuous state space, which unifies diffusion, flow matching, and causal autoregression via Markov transition decomposition. Based on this framework, we propose three variants: Differential TM (DTM), Autoregressive TM (ARTM), and Full-History TM (FHTM). FHTM is the first fully causal generator in continuous state space that surpasses non-causal flow models; DTM achieves state-of-the-art image quality and text alignment; ARTM and FHTM match non-causal methods under identical training conditions. TM supports arbitrary stochastic transition kernels and non-continuous supervision, significantly enhancing modeling flexibility and trajectory freedom.
📝 Abstract
Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.