DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Visual grounding remains challenging in text-dense, multilingual document images, limiting the interpretability and trustworthiness of document visual question answering (VQA) systems. To address this, we propose a novel fine-grained, cross-level localization paradigm—spanning blocks, lines, words, and points—featuring: (1) a multi-granularity region matching algorithm for structured alignment between OCR outputs and visual regions; (2) CircularsVQA, the first human-verified, multi-level annotated benchmark for document VQA; and (3) a multi-block/multi-line collaborative reasoning mechanism that overcomes localization accuracy bottlenecks in existing vision-language models. Our approach integrates multilingual OCR, large language models, and custom algorithms, achieving state-of-the-art performance on CircularsVQA, with optimal precision–recall trade-off at the line level. Code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at https://github.com/kasuba-badri-vishal/DhrishtiKon.

Problem

Research questions and friction points this paper is trying to address.

Enhancing visual grounding in text-rich document images

Improving interpretability in multilingual document VQA systems

Accurately localizing answer spans at multiple granularity levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-granular visual grounding framework

Integrates multilingual OCR and LLMs

Novel region matching algorithm for localization

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling