🤖 AI Summary
This study identifies the visual encoder’s prior knowledge as a critical bottleneck limiting multimodal large language model (MLLM) performance—particularly in understanding low-prior visual entities (e.g., rare objects). To address this, we propose Rankₑ, a novel metric quantifying prior strength, and empirically establish its positive correlation with model performance for the first time. We then introduce VisPRE, a two-stage training framework: Stage I explicitly injects multi-source visual priors via knowledge distillation and cross-modal alignment; Stage II jointly optimizes the visual encoder and language model. Unlike end-to-end VQA fine-tuning, VisPRE decouples and enhances prior acquisition and multimodal reasoning. It achieves state-of-the-art results across multiple VQA and fine-grained recognition benchmarks, improving rare-object question-answering accuracy by up to 12.7%.
📝 Abstract
Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder's prior knowledge is seldom investigated. In this work, we introduce a novel metric, $Rank_e$, to quantify the effect of the vision encoder's prior knowledge on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient--particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder's prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.