🤖 AI Summary
Visual grounding remains challenging in text-dense, multilingual document images, limiting the interpretability and trustworthiness of document visual question answering (VQA) systems. To address this, we propose a novel fine-grained, cross-level localization paradigm—spanning blocks, lines, words, and points—featuring: (1) a multi-granularity region matching algorithm for structured alignment between OCR outputs and visual regions; (2) CircularsVQA, the first human-verified, multi-level annotated benchmark for document VQA; and (3) a multi-block/multi-line collaborative reasoning mechanism that overcomes localization accuracy bottlenecks in existing vision-language models. Our approach integrates multilingual OCR, large language models, and custom algorithms, achieving state-of-the-art performance on CircularsVQA, with optimal precision–recall trade-off at the line level. Code and dataset are publicly released.
📝 Abstract
Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at https://github.com/kasuba-badri-vishal/DhrishtiKon.