🤖 AI Summary
This work addresses the high computational overhead incurred by diffusion models when leveraging dense patch-level features from self-supervised learning (SSL) encoders such as DINOv2, which suffer from high-dimensional redundancy. To this end, the authors propose FlatDINO, an efficient compression method based on a variational autoencoder that, for the first time, compresses high-dimensional SSL features into a compact one-dimensional sequence of only 32 continuous tokens. FlatDINO integrates seamlessly into the DiT-XL diffusion architecture and is trained with classifier-free guidance. On ImageNet at 256×256 resolution, FlatDINO achieves a generation quality of 1.80 gFID while reducing forward-pass FLOPs by 8× and decreasing per-step training FLOPs by up to 4.5×, thereby substantially improving computational efficiency without compromising output quality.
📝 Abstract
Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.