Denoising with a Joint-Embedding Predictive Architecture

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the unexplored limitations of Joint Embedding Predictive Architecture (JEPA) in generative modeling and introduces JEPA to generative tasks for the first time, proposing it as a unified framework for generalized token prediction. Methodologically, JEPA is reformulated as masked image modeling combined with continuous-space autoregressive denoising, enabling native compatibility with both diffusion and flow-matching losses and supporting multimodal continuous-data generation (e.g., video, audio). Key contributions include: (1) establishing the first JEPA-based generative paradigm; (2) deriving theoretical connections between JEPA and mainstream generative objectives—namely, diffusion and flow matching; and (3) achieving state-of-the-art performance on ImageNet conditional generation—demonstrating lower FID, faster convergence (reduced epochs), superior computational efficiency (notably scalable GFLOPs), and consistent optimality across baseline, large, and extra-large model scales.

Technology Category

Application Category

📝 Abstract

Joint-embedding predictive architectures (JEPAs) have shown substantial promise in self-supervised representation learning, yet their application in generative modeling remains underexplored. Conversely, diffusion models have demonstrated significant efficacy in modeling arbitrary probability distributions. In this paper, we introduce Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), pioneering the integration of JEPA within generative modeling. By recognizing JEPA as a form of masked image modeling, we reinterpret it as a generalized next-token prediction strategy, facilitating data generation in an auto-regressive manner. Furthermore, we incorporate diffusion loss to model the per-token probability distribution, enabling data generation in a continuous space. We also adapt flow matching loss as an alternative to diffusion loss, thereby enhancing the flexibility of D-JEPA. Empirically, with increased GFLOPs, D-JEPA consistently achieves lower FID scores with fewer training epochs, indicating its good scalability. Our base, large, and huge models outperform all previous generative models across all scales on ImageNet conditional generation benchmarks. Beyond image generation, D-JEPA is well-suited for other continuous data modeling, including video and audio.

Problem

Research questions and friction points this paper is trying to address.

Integrating JEPA in generative modeling

Enhancing data generation flexibility

Improving ImageNet generation benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Joint-Embedding Predictive Architecture

Utilizes diffusion loss for modeling

Enhances flexibility with flow matching

🔎 Similar Papers

No similar papers found.