ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current vision-language models in complex document visual question answering (DocVQA), where multi-step reasoning and specialized handling of heterogeneous document elements remain challenging. To overcome these issues, the authors propose a multi-agent collaborative framework tailored for DocVQA. The approach decomposes questions via a reasoning agent, dynamically routes subtasks to modality-specific agents through an adaptive scheduling mechanism, and enhances answer reliability through a proposition–counterproposition debate followed by arbitration. Additionally, a format consistency checker ensures standardized output. Evaluated on three DocVQA benchmarks, the method significantly outperforms state-of-the-art models, demonstrating the effectiveness and scalability of the collaborative agent architecture in fine-grained document understanding and multi-step reasoning.

Technology Category

Application Category

📝 Abstract
Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.
Problem

Research questions and friction points this paper is trying to address.

Document Visual Question Answering
Vision-Language Models
Complex Reasoning
Multi-step Workflows
Task Decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent collaboration
query decomposition
specialized agent routing
debate-based verification
document visual question answering
🔎 Similar Papers
No similar papers found.