Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models

📅 2025-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies the visual encoder’s prior knowledge as a critical bottleneck limiting multimodal large language model (MLLM) performance—particularly in understanding low-prior visual entities (e.g., rare objects). To address this, we propose Rankₑ, a novel metric quantifying prior strength, and empirically establish its positive correlation with model performance for the first time. We then introduce VisPRE, a two-stage training framework: Stage I explicitly injects multi-source visual priors via knowledge distillation and cross-modal alignment; Stage II jointly optimizes the visual encoder and language model. Unlike end-to-end VQA fine-tuning, VisPRE decouples and enhances prior acquisition and multimodal reasoning. It achieves state-of-the-art results across multiple VQA and fine-grained recognition benchmarks, improving rare-object question-answering accuracy by up to 12.7%.

Technology Category

Application Category

📝 Abstract
Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of the vision encoder's prior knowledge on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient--particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.
Problem

Research questions and friction points this paper is trying to address.

Investigates vision encoder's prior knowledge impact on MLLMs
Proposes VisPRE to enhance visual understanding in MLLMs
Addresses insufficient fine-tuning for low-visual-knowledge entities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces $Rank_e$ metric for vision encoder knowledge
Proposes VisPRE two-stage training framework
Augments vision encoder prior knowledge
🔎 Similar Papers
No similar papers found.
Qiao Liang
Qiao Liang
Unknown affiliation
artificial intelligence
Yanjiang Liu
Yanjiang Liu
UCAS
Ben He
Ben He
Professor, University of Chinese Academy of Sciences
Natural Language ProcessingInformation Retrieval
Yaojie Lu
Yaojie Lu
Institute of Software, Chinese Academy of Sciences
Information ExtractionLarge Language Models
H
Hongyu Lin
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
J
Jia Zheng
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
X
Xianpei Han
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing
Y
Yingfei Sun
University of Chinese Academy of Sciences