Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This study identifies the visual encoder’s prior knowledge as a critical bottleneck limiting multimodal large language model (MLLM) performance—particularly in understanding low-prior visual entities (e.g., rare objects). To address this, we propose Rankₑ, a novel metric quantifying prior strength, and empirically establish its positive correlation with model performance for the first time. We then introduce VisPRE, a two-stage training framework: Stage I explicitly injects multi-source visual priors via knowledge distillation and cross-modal alignment; Stage II jointly optimizes the visual encoder and language model. Unlike end-to-end VQA fine-tuning, VisPRE decouples and enhances prior acquisition and multimodal reasoning. It achieves state-of-the-art results across multiple VQA and fine-grained recognition benchmarks, improving rare-object question-answering accuracy by up to 12.7%.

Technology Category

Application Category

📝 Abstract

Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of the vision encoder's prior knowledge on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient--particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.

Problem

Research questions and friction points this paper is trying to address.

Investigates vision encoder's prior knowledge impact on MLLMs

Proposes VisPRE to enhance visual understanding in MLLMs

Addresses insufficient fine-tuning for low-visual-knowledge entities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces $Rank_e$ metric for vision encoder knowledge

Proposes VisPRE two-stage training framework

Augments vision encoder prior knowledge

🔎 Similar Papers

Law of Vision Representation in MLLMs