HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of document localization failure and fragmented multimodal evidence in open-domain document question answering by proposing a hierarchical tree-based multimodal retrieval framework. The method uniquely leverages document hierarchy as a core retrieval signal, constructing a heterogeneous graph through structural parsing and employing a coarse-to-fine retrieval strategy: it first performs global routing to narrow the search space and then fuses fine-grained textual and visual evidence to assemble an efficient evidence subgraph. By integrating hierarchical indexing with a structure-aware semantic packing mechanism, the approach overcomes limitations inherent in conventional flat chunking or page-level image-based methods. Evaluated on standard ODQA benchmarks, the framework achieves up to a 12.9% improvement in retrieval recall and a 6.8% gain in end-to-end question answering performance.
📝 Abstract
Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentation in integrating scattered information. Existing approaches relying on flat text chunks or page-level images inherently struggle to (i) precisely pinpoint the target document among thousands of candidates and (ii) organically connect multimodal evidence, such as tables and figures, within a limited token budget. To address these challenges, we propose HiKEY, a hierarchical tree-based multimodal retrieval framework that elevates document hierarchy to a first-class retrieval signal. Instead of simple chunking, HiKEY reconstructs a logical heterogeneous graph via Document Hierarchical Parsing (DHP), explicitly encoding parent-child relationships. Adopting a hierarchical coarse-to-fine strategy, the framework (1) performs global routing to rapidly prune the search space using hierarchical indexing, and (2) conducts fine-grained retrieval to rank sections by employing a multimodal fusion strategy that captures the most discriminative evidence. Finally, HiKEY assembles a token-efficient evidence subgraph via a hybrid structural-semantic packing strategy. Experiments on ODQA benchmarks demonstrate that HiKEY significantly outperforms page- and chunk-based baselines, improving retrieval recall by up to 12.9% and end-to-end QA performance by up to 6.8%.
Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented generation
open-domain question answering
multimodal retrieval
evidence fragmentation
routing failure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Retrieval
Multimodal Fusion
Document Hierarchical Parsing
Retrieval-Augmented Generation
Open-Domain QA