🤖 AI Summary
This work addresses the challenges of reasoning accuracy and efficiency in information-dense document question answering, where long contexts and information overload hinder existing end-to-end vision-language models (VLMs). The authors propose a functionally decoupled multi-model collaborative architecture that leverages a lightweight VLM to extract visual cues and convert them into textual form, which is then processed by a large language model for logical reasoning. A query-complexity-aware routing mechanism dynamically selects the optimal inference path, and the framework is further enhanced with retrieval-augmented generation (RAG). The approach achieves new state-of-the-art results on DocBench and MMLongBench while significantly reducing inference costs, with ablation studies confirming the effectiveness of each component.
📝 Abstract
Information-intensive Document Question Answering (DocQA) is often constrained by long contexts and information overload, which hinders Vision-Language Models (VLMs) from performing precise direct reasoning. Although multimodal GraphRAG has achieved preliminary breakthroughs, existing frameworks still face dual challenges: (1) the necessity of large-scale models for handling queries of diverse complexities and (2) the inherent reasoning bottlenecks of end-to-end VLMs. To address these issues, we propose AutoThinkRAG, a framework that enhances the understanding of complex documents by synergizing the capabilities of multiple models. Specifically, we introduce a Query Complexity Router to allocate reasoning paths based on the analysis of query difficulty. Furthermore, to overcome the reasoning boundaries of VLM, we propose a functional decoupling architecture: a small-scale VLM serves as a high-fidelity visual interpreter to transform query-relevant visual cues into textual representations, which are subsequently processed by an LLM for logical deduction and synthesis. Extensive experiments on DocBench and MMLongBench demonstrate that AutoThinkRAG significantly reduces inference costs while achieving new state-of-the-art performance. Further ablation studies verifies the effectiveness of our proposed method.