Next-Embedding Prediction Makes Strong Vision Learners

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work investigates the feasibility of generative pre-training for visual self-supervised learning. We propose Next-Embedding Predictive Autoregression (NEPA), a paradigm that abandons pixel reconstruction, discrete tokenization, and contrastive losses, instead performing causal autoregressive prediction directly over continuous patch embeddings in latent space. NEPA employs a Vision Transformer (ViT) backbone with causal masking and gradient truncation, enabling end-to-end, head-free modeling while supporting cross-modal extensibility. On ImageNet-1K fine-tuning, NEPA achieves 83.8% and 85.3% top-1 accuracy with ViT-Base and ViT-Large, respectively. In transfer learning to ADE20K semantic segmentation, it significantly outperforms leading self-supervised methods. Our key contribution is the first direct application of generative pre-training to continuous visual embedding prediction—establishing a novel, unified representation learning paradigm that bridges language-inspired autoregression and vision foundation models.

Technology Category

Application Category

📝 Abstract

Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

Problem

Research questions and friction points this paper is trying to address.

Develops a generative pretraining method for vision models

Shifts from learning representations to learning predictive models

Explores embedding prediction as a self-supervised learning objective

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predict future patch embeddings using autoregression

Use causal masking and stop gradient techniques

Train Transformer with next-embedding prediction objective

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models