Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

Standard large vision-language model (LVLM) pretraining relies on next-token prediction (NTP), which is vulnerable to image-irrelevant textual noise, leading to hallucination. To address this, we propose PRIOR—a novel importance-sampling framework for multimodal pretraining. PRIOR leverages a text-only large language model to automatically estimate the relevance of each token to the input image and applies token-level dynamic weighting to the NTP loss, thereby prioritizing learning of image-relevant tokens. Crucially, PRIOR requires no human annotation or architectural modification and is compatible with diverse LVLMs—both those equipped with and without visual encoders. Empirical evaluation across multiple vision-language benchmarks demonstrates average relative performance gains of 19% (for LVLMs with visual encoders) and 8% (for encoder-free variants). Moreover, PRIOR exhibits superior scalability in both computational efficiency and data utilization.

Technology Category

Application Category

📝 Abstract

In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data.

Problem

Research questions and friction points this paper is trying to address.

Standard LVLMs pre-training fits noise due to naive NTP

PRIOR prioritizes image-related tokens via differential weighting

PRIOR improves vision-language benchmarks by 19% and 8%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritizes image-related tokens via differential weighting

Uses text-only LLM to weight tokens for LVLMs training

Implements token-specific re-weighting based on importance scores

🔎 Similar Papers

No similar papers found.