Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenges of hallucination and suboptimal reasoning performance in multi-visual-language model (VLM) fusion by proposing V3Fusion, a novel method that dynamically selects models through joint exploitation of visual and linguistic modalities. V3Fusion introduces a visual verification mechanism that leverages CKA-focal disagreement metrics and focal error diversity to guide a genetic algorithm in selecting complementary models from a heterogeneous pool, thereby enabling robust output fusion even in scenarios where consensus is absent or the majority of models err. The approach explicitly models epistemic uncertainty and achieves state-of-the-art results, outperforming the best single model by 8.09% and 4.87% on MMMU and MMMU-Pro, respectively, and surpassing advanced VLMs such as Intern-VL2-8B and Qwen2.5-VL-7B on generative tasks in A-OKVQA and OCR-VQA.

Technology Category

Application Category

📝 Abstract

With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Model Fusion

Visual Reasoning

Ensemble Learning

Hallucination Mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM fusion

focal error diversity

CKA-focal