ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Document Visual Question Answering (DocVQA) faces a fundamental trade-off between answer text accuracy and spatial localization reliability. To address this, we propose a modular planning agent framework powered by large language models (LLMs), which decomposes the task into four interpretable and independently optimizable submodules: OCR recognition (TrOCR), semantic-enhanced retrieval, answer generation (via fine-tuned Gemma-3-27B), and text-region alignment. An LLM orchestrates this toolchain to enable end-to-end high-precision answer extraction with accurate bounding-box localization. Our method achieves state-of-the-art performance on four major benchmarks—DocVQA, FUNSD, CORD, and SROIE—with 88.7 ANLS and 50.1 mAP on DocVQA, surpassing DLaVA by +2.8 ANLS and +3.9 mAP. The core contribution is the first LLM-based agent design that unifies performance and interpretability, while empirically validating that modular decoupling critically enhances localization precision.

Technology Category

Application Category

📝 Abstract

Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region alignment. This modular architecture produces transparent reasoning traces, enabling tool-level auditability and independent component optimization. We evaluate ARIAL on four benchmarks (DocVQA, FUNSD, CORD, and SROIE) using both textual accuracy (ANLS) and spatial precision (mAP at IoU 0.50 to 0.95). ARIAL achieves state-of-the-art results across all datasets: 88.7 ANLS and 50.1 mAP on DocVQA, 90.0 ANLS and 50.3 mAP on FUNSD, 85.5 ANLS and 60.2 mAP on CORD, and 93.1 ANLS on SROIE, surpassing the previous best method (DLaVA) by +2.8 ANLS and +3.9 mAP on DocVQA. Our work demonstrates how agentic orchestration of specialized tools can simultaneously improve performance and interpretability, providing a pathway toward trustworthy, explainable document AI systems.

Problem

Research questions and friction points this paper is trying to address.

Achieving precise answer localization in document visual question answering

Resolving trade-off between textual accuracy and spatial grounding reliability

Developing interpretable Document VQA systems with transparent reasoning traces

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based agent orchestrates specialized modular tools

Decomposes VQA into OCR retrieval generation localization

Achieves transparent reasoning with tool-level auditability

🔎 Similar Papers

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models