RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address information overload caused by injecting high-dimensional visual foundation model (VFM) features—e.g., DINOv3—into diffusion Transformers, this paper proposes Representation Packing: a manifold-constrained semantic-aware feature compression mechanism that projects VFM features onto a low-dimensional manifold, preserving core semantic structure while suppressing non-semantic noise. The method balances representational compactness and generative fidelity, and integrates seamlessly into the Diffusion Transformer (DiT) architecture. Evaluated on DiT-XL/2, it achieves an FID of 3.66 within only 64 training epochs—accelerating convergence by 35% over state-of-the-art methods—and significantly improves image reconstruction quality and decoding efficiency. Its core innovation lies in being the first to introduce manifold-regularized, semantics-preserving feature compression into diffusion-based generative modeling, establishing a new paradigm for efficient, VFM-driven synthesis.

Technology Category

Application Category

📝 Abstract

The superior representation capability of pre-trained vision foundation models (VFMs) has been harnessed for enhancing latent diffusion models (LDMs). These approaches inject the rich semantics from high-dimensional VFM representations (e.g., DINOv3) into LDMs at different phases, resulting in accelerated learning and better generation performance. However, the high-dimensionality of VFM representations may also lead to Information Overload, particularly when the VFM features exceed the size of the original image for decoding. To address this issue while preserving the utility of VFM features, we propose RePack (Representation Packing), a simple yet effective framework for improving Diffusion Transformers (DiTs). RePack transforms the VFM representation into a more compact, decoder-friendly representation by projecting onto low-dimensional manifolds. We find that RePack can effectively filter out non-semantic noise while preserving the core structural information needed for high-fidelity reconstruction. Experimental results show that RePack significantly accelerates DiT convergence and outperforms recent methods that directly inject raw VFM features into the decoder for image reconstruction. On DiT-XL/2, RePack achieves an FID of 3.66 in only 64 epochs, which is 35% faster than the state-of-the-art method. This demonstrates that RePack successfully extracts the core semantics of VFM representations while bypassing their high-dimensionality side effects.

Problem

Research questions and friction points this paper is trying to address.

Addresses information overload from high-dimensional VFM features in diffusion models

Proposes a compact representation method to preserve core semantics for reconstruction

Enhances DiT convergence speed and generation performance over raw feature injection

Innovation

Methods, ideas, or system contributions that make the work stand out.

RePack compresses high-dimensional VFM features into compact representations

It projects features onto low-dimensional manifolds to reduce information overload

This accelerates DiT convergence and improves image generation performance

🔎 Similar Papers

No similar papers found.