🤖 AI Summary
To address information overload caused by injecting high-dimensional visual foundation model (VFM) features—e.g., DINOv3—into diffusion Transformers, this paper proposes Representation Packing: a manifold-constrained semantic-aware feature compression mechanism that projects VFM features onto a low-dimensional manifold, preserving core semantic structure while suppressing non-semantic noise. The method balances representational compactness and generative fidelity, and integrates seamlessly into the Diffusion Transformer (DiT) architecture. Evaluated on DiT-XL/2, it achieves an FID of 3.66 within only 64 training epochs—accelerating convergence by 35% over state-of-the-art methods—and significantly improves image reconstruction quality and decoding efficiency. Its core innovation lies in being the first to introduce manifold-regularized, semantics-preserving feature compression into diffusion-based generative modeling, establishing a new paradigm for efficient, VFM-driven synthesis.
📝 Abstract
The superior representation capability of pre-trained vision foundation models (VFMs) has been harnessed for enhancing latent diffusion models (LDMs). These approaches inject the rich semantics from high-dimensional VFM representations (e.g., DINOv3) into LDMs at different phases, resulting in accelerated learning and better generation performance. However, the high-dimensionality of VFM representations may also lead to Information Overload, particularly when the VFM features exceed the size of the original image for decoding. To address this issue while preserving the utility of VFM features, we propose RePack (Representation Packing), a simple yet effective framework for improving Diffusion Transformers (DiTs). RePack transforms the VFM representation into a more compact, decoder-friendly representation by projecting onto low-dimensional manifolds. We find that RePack can effectively filter out non-semantic noise while preserving the core structural information needed for high-fidelity reconstruction. Experimental results show that RePack significantly accelerates DiT convergence and outperforms recent methods that directly inject raw VFM features into the decoder for image reconstruction. On DiT-XL/2, RePack achieves an FID of 3.66 in only 64 epochs, which is 35% faster than the state-of-the-art method. This demonstrates that RePack successfully extracts the core semantics of VFM representations while bypassing their high-dimensionality side effects.