DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current I-JEPA models predict masked latent embeddings in a uniform, unordered parallel manner, lacking the saliency-driven selective attention mechanism inherent in human vision. To address this, we propose DSeq-JEPA—the first joint embedding prediction architecture to incorporate discriminative sequential attention. Our method employs a Transformer-based saliency estimator to localize semantically critical regions and establishes a progressive “primary→secondary” prediction curriculum. By integrating autoregressive sequence modeling with contrastive learning, DSeq-JEPA enables ordered, multi-stage inference over masked latent embeddings. Extensive experiments demonstrate that DSeq-JEPA consistently outperforms I-JEPA variants across diverse downstream tasks—including image classification, fine-grained recognition, object detection/segmentation, and visual reasoning—while significantly enhancing discriminative semantic representation learning and cross-task generalization.

Technology Category

Application Category

📝 Abstract

Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues -- a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

Problem

Research questions and friction points this paper is trying to address.

Improves visual representation learning by predicting masked regions sequentially

Addresses uniform treatment of image regions lacking discriminative ordering

Integrates joint-embedding prediction with sequential reasoning for better representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses transformer-derived saliency map to identify regions

Predicts regions sequentially from most to least discriminative

Integrates JEPA latent prediction with GPT-style reasoning

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models