VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the challenges of large parameter counts, high decoding overhead, and difficulty in multi-task adaptation inherent in prevailing vision-language models, this paper introduces VL-JEPA—a lightweight and efficient model built upon the Joint Embedding Predictive Architecture (JEPA). Methodologically, VL-JEPA abandons autoregressive token generation and instead directly predicts continuous semantic embeddings of target text, enabling task-relevant semantic learning via contrastive objectives in an abstract embedding space. It incorporates a lightweight conditional text decoder and a shared visual encoder, supporting selective decoding (reducing decoding operations by 2.85×) and enabling unified handling—without architectural modification—of video classification, retrieval, generation, and discriminative VQA. Experiments demonstrate that VL-JEPA outperforms CLIP, SigLIP2, and Perception Encoder across 16 video understanding benchmarks; with only 1.6B parameters, it matches the performance of InstructBLIP and QwenVL on four VQA benchmarks while reducing parameter count by 50%.

Technology Category

Application Category

📝 Abstract

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

Problem

Research questions and friction points this paper is trying to address.

Predicts continuous text embeddings instead of tokens

Reduces trainable parameters by 50% while improving performance

Supports selective decoding and open-vocabulary tasks natively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts continuous embeddings instead of tokens

Uses lightweight decoder only when needed

Achieves strong performance with fewer parameters

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs