JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Current text-to-image (T2I) models face a bottleneck in text–visual token fusion, struggling to simultaneously achieve precise conditional control and backbone generality. To address this, we propose JEPA-T: a unified multimodal framework built upon the Joint Embedding Predictive Architecture (JEPA), which enhances text guidance via cross-attention and enforces hierarchical alignment between raw text embeddings and discrete visual features—enabling seamless integration of class-conditional and open-vocabulary text generation. Technically, JEPA-T integrates discrete multimodal tokenization, a JEPA-Transformer backbone, conditional denoising diffusion, and flow-matching loss. On ImageNet-1K, JEPA-T achieves significant improvements in data efficiency and open-vocabulary generalization, consistently outperforming both non-fusion and post-fusion baselines across all metrics.

Technology Category

Application Category

📝 Abstract

Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose extbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git

Problem

Research questions and friction points this paper is trying to address.

Enhancing text-visual token fusion in image generation

Improving alignment between text and visual representations

Balancing conditioning strength with model generality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint-embedding predictive Transformer for multimodal fusion

Cross-attention mechanism for conditional denoising

Text embedding injection for improved alignment

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval