AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the challenges of reasoning accuracy and efficiency in information-dense document question answering, where long contexts and information overload hinder existing end-to-end vision-language models (VLMs). The authors propose a functionally decoupled multi-model collaborative architecture that leverages a lightweight VLM to extract visual cues and convert them into textual form, which is then processed by a large language model for logical reasoning. A query-complexity-aware routing mechanism dynamically selects the optimal inference path, and the framework is further enhanced with retrieval-augmented generation (RAG). The approach achieves new state-of-the-art results on DocBench and MMLongBench while significantly reducing inference costs, with ablation studies confirming the effectiveness of each component.

Technology Category

Application Category

📝 Abstract

Information-intensive Document Question Answering (DocQA) is often constrained by long contexts and information overload, which hinders Vision-Language Models (VLMs) from performing precise direct reasoning. Although multimodal GraphRAG has achieved preliminary breakthroughs, existing frameworks still face dual challenges: (1) the necessity of large-scale models for handling queries of diverse complexities and (2) the inherent reasoning bottlenecks of end-to-end VLMs. To address these issues, we propose AutoThinkRAG, a framework that enhances the understanding of complex documents by synergizing the capabilities of multiple models. Specifically, we introduce a Query Complexity Router to allocate reasoning paths based on the analysis of query difficulty. Furthermore, to overcome the reasoning boundaries of VLM, we propose a functional decoupling architecture: a small-scale VLM serves as a high-fidelity visual interpreter to transform query-relevant visual cues into textual representations, which are subsequently processed by an LLM for logical deduction and synthesis. Extensive experiments on DocBench and MMLongBench demonstrate that AutoThinkRAG significantly reduces inference costs while achieving new state-of-the-art performance. Further ablation studies verifies the effectiveness of our proposed method.

Problem

Research questions and friction points this paper is trying to address.

Document Question Answering

Vision-Language Models

Information Overload

Reasoning Bottleneck

Query Complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation

Vision-Language Models

Query Complexity Routing