🤖 AI Summary
This work addresses the limitations of existing few-shot generation methods, which rely on generative model priors that fail to capture semantic rarity as observed in the real world. The authors propose a world-centric few-shot sampling perspective, introducing the Joint-Embedding Predictive Architecture (JEPA) as a world model within a diffusion guidance framework for the first time. By leveraging implicit density estimation and optimizing in low-density regions of the data manifold, the method defines and generates samples that align with genuine semantic rarity. It provides an efficient approximation strategy with theoretical error bounds, significantly improving both fidelity and semantic validity of generated few-shot samples across unconditional, class-conditional, and text-to-image generation tasks, outperforming approaches based on conventional generative priors.
📝 Abstract
Minority sampling aims to generate low-density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model-specific notions that may poorly reflect real-world semantics. In this work, we propose a world-centric perspective on minority sampling, which defines rarity with respect to real-world priors rather than generator-induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint-Embedding Predictive Architecture (JEPA) -- a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low-density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real-world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class-conditional, and text-to-image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator-centric baselines in capturing real-world notions of rarity. Code is available at https://github.com/soobin-um/jepa-guidance.