Denoising with a Joint-Embedding Predictive Architecture

📅 2024-10-02
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the unexplored limitations of Joint Embedding Predictive Architecture (JEPA) in generative modeling and introduces JEPA to generative tasks for the first time, proposing it as a unified framework for generalized token prediction. Methodologically, JEPA is reformulated as masked image modeling combined with continuous-space autoregressive denoising, enabling native compatibility with both diffusion and flow-matching losses and supporting multimodal continuous-data generation (e.g., video, audio). Key contributions include: (1) establishing the first JEPA-based generative paradigm; (2) deriving theoretical connections between JEPA and mainstream generative objectives—namely, diffusion and flow matching; and (3) achieving state-of-the-art performance on ImageNet conditional generation—demonstrating lower FID, faster convergence (reduced epochs), superior computational efficiency (notably scalable GFLOPs), and consistent optimality across baseline, large, and extra-large model scales.

Technology Category

Application Category

📝 Abstract
Joint-embedding predictive architectures (JEPAs) have shown substantial promise in self-supervised representation learning, yet their application in generative modeling remains underexplored. Conversely, diffusion models have demonstrated significant efficacy in modeling arbitrary probability distributions. In this paper, we introduce Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), pioneering the integration of JEPA within generative modeling. By recognizing JEPA as a form of masked image modeling, we reinterpret it as a generalized next-token prediction strategy, facilitating data generation in an auto-regressive manner. Furthermore, we incorporate diffusion loss to model the per-token probability distribution, enabling data generation in a continuous space. We also adapt flow matching loss as an alternative to diffusion loss, thereby enhancing the flexibility of D-JEPA. Empirically, with increased GFLOPs, D-JEPA consistently achieves lower FID scores with fewer training epochs, indicating its good scalability. Our base, large, and huge models outperform all previous generative models across all scales on ImageNet conditional generation benchmarks. Beyond image generation, D-JEPA is well-suited for other continuous data modeling, including video and audio.
Problem

Research questions and friction points this paper is trying to address.

Integrating JEPA in generative modeling
Enhancing data generation flexibility
Improving ImageNet generation benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Joint-Embedding Predictive Architecture
Utilizes diffusion loss for modeling
Enhances flexibility with flow matching
🔎 Similar Papers
No similar papers found.