🤖 AI Summary
This work addresses trajectory conflicts and suboptimal generation arising from semantic discontinuities in pixel space, which entangle optimal transport paths. To resolve this, the authors propose the Waypoint Diffusion Transformer (WiT), which decouples the generation process into a two-segment path by introducing intermediate semantic waypoints guided by a pretrained vision model. This approach explicitly models trajectories in pixel space, circumventing information loss inherent in latent-space formulations. WiT incorporates a novel Just-Pixel AdaLN mechanism that enables continuous, waypoint-driven conditioning of the Transformer backbone. Built upon the Flow Matching and diffusion frameworks, WiT outperforms strong baselines on ImageNet at 256×256 resolution and accelerates JiT training convergence by a factor of 2.2.
📝 Abstract
While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.