🤖 AI Summary
This work addresses the limitation of existing visual document retrieval models, which rely solely on local image patch embeddings and neglect global layout structure, leading to inaccurate relevance estimation under heterogeneous text-image layouts. The authors propose a plug-and-play global layout modeling approach that requires no modification to the inference pipeline: document-level layout semantics are learned under textual description supervision and seamlessly integrated into local patch representations. This method introduces, for the first time in late-interaction architectures, a learnable global layout encoding mechanism that captures structural information using only textual supervision. Evaluated on four ViDoRe-v2 datasets, the approach significantly outperforms ColPali and ColQwen baselines, achieving statistically significant gains of 2.4 in nDCG@5 and 2.3 in MAP@5.
📝 Abstract
Visual Document Retrieval (VDR) models mostly rely on late interaction architectures, in which documents are represented by a set of local patch embeddings and then matched against query tokens. While efficient, this architecture prioritizes local similarity over global layout structure of documents to estimate relevancy between documents and query. In practice, this leads to errors as relevance originates from layout structure of documents with heterogeneous layouts combining figures, tables, and text. We make document layout learnable without changing inference. We propose a multimodal encoder that augments local patch representations with a global layout embedding, trained via textual descriptions encoding document layout information. Across four ViDoRe-v2 datasets, our model improves over the strongest architecturally comparable ColPali/ColQwen baseline by +2.4 nDCG@5 and +2.3 MAP@5, with statistically significant per-dataset gains over ColQwen.