🤖 AI Summary
This work systematically investigates the design, training, and sampling strategies of the time-continuous bidirectional “head” module within the Transition Matching (TM) framework to enhance both quality and efficiency in text-to-image generation. Based on 549 ablation experiments across 56 models (1.7B parameters each), we quantitatively analyze the impact of architectural choices (MLP vs. Transformer heads), temporal weighting mechanisms, and stochastic sampling frequency on FID, CLIP Score, and training/inference overhead. We uncover, for the first time, key design principles for TM heads: (i) MLP heads with high-frequency sampling achieve state-of-the-art overall performance; (ii) Transformer heads—when combined with sequence scaling and low-frequency sampling—significantly improve aesthetic quality. Our findings provide a reproducible, scalable design guideline and an efficient pathway for industrial deployment of TM.
📝 Abstract
Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.