AVIR: Adaptive Visual In-Document Retrieval for Efficient Multi-Page Document Question Answering

📅 2025-12-06

🏛️ ACM Multimedia Asia

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the high computational cost and inefficient attention mechanisms of large vision-language models (LVLMs) when processing lengthy multi-page documents for visual question answering. To mitigate these issues, the authors propose an adaptive visual document retrieval framework that employs a lightweight model to score page relevance. For the first time, they introduce a score-distribution-based adaptive clustering and thresholding mechanism to dynamically select key pages. Only these selected pages are fed into a frozen LVLM to generate answers, eliminating the need for fine-tuning. Evaluated on MP-DocVQA, the method achieves 84.58% ANLS with a 70% page reduction rate, significantly improving both inference efficiency and accuracy. The approach further demonstrates strong generalization across SlideVQA and DUDE benchmarks.

Technology Category

Application Category

📝 Abstract

Multi‑page Document Visual Question Answering (MP‑DocVQA) remains challenging because long documents not only strain computational resources but also reduce the effectiveness of the attention mechanism in large vision–language models (LVLMs). We tackle these issues with an Adaptive Visual In‑document Retrieval (AVIR) framework. A lightweight retrieval model first scores each page for question relevance. Pages are then clustered according to the score distribution to adaptively select relevant content. The clustered pages are screened again by Top-K to keep the context compact. However, for short documents, clustering reliability decreases, so we use a relevance probability threshold to select pages. The selected pages alone are fed to a frozen LVLM for answer generation, eliminating the need for model fine‑tuning. The proposed AVIR framework reduces the average page count required for question answering by 70%, while achieving an ANLS of 84.58% on the MP-DocVQA dataset—surpassing previous methods with significantly lower computational cost. The effectiveness of the proposed AVIR is also verified on the SlideVQA and DUDE benchmarks. Our code will be made publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Multi-page Document Visual Question Answering

computational efficiency

attention mechanism

large vision-language models

document retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Retrieval

Multi-Page Document VQA

Vision-Language Model