🤖 AI Summary
Current text-to-image (T2I) models face a bottleneck in text–visual token fusion, struggling to simultaneously achieve precise conditional control and backbone generality. To address this, we propose JEPA-T: a unified multimodal framework built upon the Joint Embedding Predictive Architecture (JEPA), which enhances text guidance via cross-attention and enforces hierarchical alignment between raw text embeddings and discrete visual features—enabling seamless integration of class-conditional and open-vocabulary text generation. Technically, JEPA-T integrates discrete multimodal tokenization, a JEPA-Transformer backbone, conditional denoising diffusion, and flow-matching loss. On ImageNet-1K, JEPA-T achieves significant improvements in data efficiency and open-vocabulary generalization, consistently outperforming both non-fusion and post-fusion baselines across all metrics.
📝 Abstract
Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose extbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git