Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Current vision-language models (VLMs) suffer from limited parameter scale and weak self-correction capability, leading to suboptimal performance in long visual context modeling and complex reasoning—particularly in document understanding and visual question answering. To address this, we propose MACT, a multi-agent collaborative framework that decouples roles into planning, execution, judgment, and answering. Crucially, MACT introduces, for the first time, an independent judgment agent dedicated to result verification and iterative refinement. It further integrates hybrid reward modeling with agent-specific test-time scaling strategies to balance collaborative efficiency and individual agent capability. Evaluated on 15 benchmarks, MACT’s three variants achieve the top three average scores, outperforming prior methods on 13 tasks. Notably, it significantly advances long-context comprehension and complex reasoning while maintaining strong generalization and mathematical reasoning capabilities—even at reduced parameter scales.

Technology Category

Application Category

📝 Abstract

Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: https://github.com/YU-deep/MACT.git.

Problem

Research questions and friction points this paper is trying to address.

Overcome limitations of vision-language models in document tasks

Enhance complex reasoning in long visual contexts

Improve self-correction and collaboration in multi-agent frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework with four specialized agents

Mixed reward modeling balances individual and global performance

Agent-wise hybrid test-time scaling customizes strategies

🔎 Similar Papers

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering