AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of reasoning accuracy and efficiency in information-dense document question answering, where long contexts and information overload hinder existing end-to-end vision-language models (VLMs). The authors propose a functionally decoupled multi-model collaborative architecture that leverages a lightweight VLM to extract visual cues and convert them into textual form, which is then processed by a large language model for logical reasoning. A query-complexity-aware routing mechanism dynamically selects the optimal inference path, and the framework is further enhanced with retrieval-augmented generation (RAG). The approach achieves new state-of-the-art results on DocBench and MMLongBench while significantly reducing inference costs, with ablation studies confirming the effectiveness of each component.

Technology Category

Application Category

📝 Abstract
Information-intensive Document Question Answering (DocQA) is often constrained by long contexts and information overload, which hinders Vision-Language Models (VLMs) from performing precise direct reasoning. Although multimodal GraphRAG has achieved preliminary breakthroughs, existing frameworks still face dual challenges: (1) the necessity of large-scale models for handling queries of diverse complexities and (2) the inherent reasoning bottlenecks of end-to-end VLMs. To address these issues, we propose AutoThinkRAG, a framework that enhances the understanding of complex documents by synergizing the capabilities of multiple models. Specifically, we introduce a Query Complexity Router to allocate reasoning paths based on the analysis of query difficulty. Furthermore, to overcome the reasoning boundaries of VLM, we propose a functional decoupling architecture: a small-scale VLM serves as a high-fidelity visual interpreter to transform query-relevant visual cues into textual representations, which are subsequently processed by an LLM for logical deduction and synthesis. Extensive experiments on DocBench and MMLongBench demonstrate that AutoThinkRAG significantly reduces inference costs while achieving new state-of-the-art performance. Further ablation studies verifies the effectiveness of our proposed method.
Problem

Research questions and friction points this paper is trying to address.

Document Question Answering
Vision-Language Models
Information Overload
Reasoning Bottleneck
Query Complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
Vision-Language Models
Query Complexity Routing
Functional Decoupling
Multimodal Reasoning
🔎 Similar Papers
No similar papers found.
J
Jiashu Yang
Dalian University of Technology
C
Chi Zhang
Dalian University of Technology
A
Abudukelimu Wuerkaixi
Tsinghua University
Xuxin Cheng
Xuxin Cheng
University of California, San Diego
C
Cao Liu
Meituan LongCat Interaction Team
K
Ke Zeng
Meituan LongCat Interaction Team
Xu Jia
Xu Jia
Associate Professor at Dalian University of Technology
Computer VisionMachine LearningBio-Inspired Vision
X
Xunliang Cai
Meituan LongCat Interaction Team