Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual reasoning over information-dense images—such as infographics and charts—is challenging due to heterogeneous multimodal content and fragmented visual-textual cues. Method: This paper proposes a training-free speculative verification framework that first generates multiple candidate reasoning paths using a lightweight vision-language model, then performs multi-hop evidence aggregation and consistency verification via a strong discriminative model, and finally selects high-consensus paths through a multi-expert agreement mechanism. Contribution/Results: Inspired by speculative decoding, the framework jointly optimizes error correction and computational efficiency. Evaluated on high-resolution benchmarks—including InfographicVQA and ChartMuseum—it substantially outperforms mainstream large closed-source models while reducing inference cost by a significant margin.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict
Problem

Research questions and friction points this paper is trying to address.

Addresses reasoning over information-intensive images with dense layouts
Solves precise localization of critical cues in complex visual data
Enables multi-hop reasoning to integrate dispersed visual evidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines lightweight draft experts with large verdict model
Uses consensus selection to forward high-agreement reasoning paths
Synthesizes multiple partial paths for error correction and efficiency
🔎 Similar Papers
No similar papers found.