Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work investigates how large language models (LLMs) spontaneously acquire visual priors solely through text-only pretraining. Method: We formally decouple visual priors in LLMs into separable “perceptual priors” and “reasoning priors”, and conduct systematic, large-scale controlled experiments—spanning five model scales, ~1T-token pretraining budgets, and diverse data composition ratios—to isolate their origins and properties. Contribution/Results: We find that reasoning priors primarily emerge from code and mathematical data and exhibit strong cross-task transferability, whereas perceptual priors depend critically on text rich in visual semantics. Based on this, we propose a data-driven paradigm for pre-cultivating visual perception capabilities in LLMs. Empirical evaluation on multimodal alignment, vision-instruction fine-tuning, and the MLE-Bench benchmark demonstrates that text-only LLMs can perform complex visual reasoning—without any image input. Our findings establish theoretical foundations and methodological frameworks for next-generation vision-capable LLMs trained exclusively on text.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and in some cases, to perform visual tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors-the implicit, emergent knowledge about the visual world acquired during language pre-training-are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (e.g., code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, a perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline-from LLM pre-training to visual alignment and supervised multimodal fine-tuning-across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we propose and investigate several hypotheses, and introduce the Multi-Level Existence Bench (MLE-Bench). Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.

Problem

Research questions and friction points this paper is trying to address.

Understanding how LLMs develop visual knowledge from text-only training

Analyzing separable perception and reasoning priors in language models

Developing methods to cultivate visual capabilities from language pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs develop visual priors from text-only training

Separable perception and reasoning priors with scaling trends

Data-centric recipe for pre-training vision-aware LLMs

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment