JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image (T2I) models face a bottleneck in text–visual token fusion, struggling to simultaneously achieve precise conditional control and backbone generality. To address this, we propose JEPA-T: a unified multimodal framework built upon the Joint Embedding Predictive Architecture (JEPA), which enhances text guidance via cross-attention and enforces hierarchical alignment between raw text embeddings and discrete visual features—enabling seamless integration of class-conditional and open-vocabulary text generation. Technically, JEPA-T integrates discrete multimodal tokenization, a JEPA-Transformer backbone, conditional denoising diffusion, and flow-matching loss. On ImageNet-1K, JEPA-T achieves significant improvements in data efficiency and open-vocabulary generalization, consistently outperforming both non-fusion and post-fusion baselines across all metrics.

Technology Category

Application Category

📝 Abstract
Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose extbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git
Problem

Research questions and friction points this paper is trying to address.

Enhancing text-visual token fusion in image generation
Improving alignment between text and visual representations
Balancing conditioning strength with model generality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint-embedding predictive Transformer for multimodal fusion
Cross-attention mechanism for conditional denoising
Text embedding injection for improved alignment
🔎 Similar Papers
No similar papers found.
S
Siheng Wan
Jiangsu University
Z
Zhengtao Yao
University of Southern California
Zhengdao Li
Zhengdao Li
The Chinese University of Hong Kong, Shenzhen
Machine learning on Graphsgraph representation learning
J
Junhao Dong
Nanyang Technological University
Yanshu Li
Yanshu Li
Brown University
NLPMultimodal Learning
Y
Yikai Li
Jiangsu University
L
Linshan Li
Jiangsu University
Haoyan Xu
Haoyan Xu
University of Southern California
Machine Learning
Yijiang Li
Yijiang Li
Argonne National Laboratory
Z
Zhikang Dong
Stony Brook University
H
Huacan Wang
University of the Chinese Academy of Sciences
Jifeng Shen
Jifeng Shen
Jiangsu University
Computer Vision