🤖 AI Summary
Existing KBVQA methods rely on a single evidence source, leading to superficial reasoning and insufficient robustness and interpretability. To address this, we propose Synergos-VQA—a novel framework that, for the first time, synergistically integrates three heterogeneous evidence streams: (i) global scene awareness (holistic evidence), (ii) prototype-driven key object identification (structural evidence), and (iii) counterfactual probing–enabled causal reasoning (causal evidence). A dynamic multi-source fusion mechanism and plug-and-play module design facilitate cross-dimensional evidence complementarity and mutual verification. Evaluated on three major benchmarks—OK-VQA, A-OKVQA, and VQAv2—Synergos-VQA achieves state-of-the-art performance. It consistently improves question-answering accuracy and reasoning reliability across multiple open-source multimodal large language models (MLLMs), empirically demonstrating that multi-evidence synergy is more effective than mere model scaling.
📝 Abstract
Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This "seeing only the trees, but not the forest" approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos-VQA, a novel synergistic reasoning framework. At its core, Synergos-VQA concurrently generates and fuses three complementary evidence streams at inference time: (1) Holistic Evidence to perceive the entire scene (the "forest"), (2) Structural Evidence from a prototype-driven module to identify key objects (the "trees"), and (3) Causal Evidence from a counterfactual probe to ensure the reasoning is robustly grounded. By synergistically fusing this multi-faceted evidence, our framework achieves a more comprehensive and reliable reasoning process. Extensive experiments show that Synergos-VQA decisively establishes a new state-of-the-art on three challenging benchmarks, including OK-VQA and A-OKVQA. Furthermore, our approach demonstrates strong plug-and-play capabilities, significantly boosting various open-source MLLMs and proving that superior methodological design can outperform sheer model scale.