Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit limited performance on visual reasoning tasks, primarily due to overreliance on subjective and incomplete textual supervision, coupled with significantly smaller-scale multimodal instruction tuning compared to text-only pretraining—leading to insufficient modeling of fine-grained visual details. To address this, we propose JARVIS, the first framework to integrate the I-JEPA self-supervised paradigm into MLLM vision–language alignment: it freezes the visual encoder (e.g., ViT) and repurposes early layers of the LLM as a lightweight, trainable predictor to directly model structural and semantic regularities in images—enabling pure visual self-supervised enhancement. JARVIS achieves substantial performance gains across multiple vision-centric benchmarks, maintains compatibility with diverse LLM backbones, and preserves strong vision–language reasoning capabilities. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.

Problem

Research questions and friction points this paper is trying to address.

Enhances visual reasoning in multimodal language models

Reduces overfitting to language priors in MLLMs

Integrates self-supervised learning for visual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates I-JEPA self-supervised learning into vision-language alignment

Uses frozen vision models as encoders to learn visual regularities

Trains LLM layers as predictor without relying on language supervision

🔎 Similar Papers

Law of Vision Representation in MLLMs