Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
This work addresses the limitation of existing visual document retrieval models, which rely solely on local image patch embeddings and neglect global layout structure, leading to inaccurate relevance estimation under heterogeneous text-image layouts. The authors propose a plug-and-play global layout modeling approach that requires no modification to the inference pipeline: document-level layout semantics are learned under textual description supervision and seamlessly integrated into local patch representations. This method introduces, for the first time in late-interaction architectures, a learnable global layout encoding mechanism that captures structural information using only textual supervision. Evaluated on four ViDoRe-v2 datasets, the approach significantly outperforms ColPali and ColQwen baselines, achieving statistically significant gains of 2.4 in nDCG@5 and 2.3 in MAP@5.
📝 Abstract
Visual Document Retrieval (VDR) models mostly rely on late interaction architectures, in which documents are represented by a set of local patch embeddings and then matched against query tokens. While efficient, this architecture prioritizes local similarity over global layout structure of documents to estimate relevancy between documents and query. In practice, this leads to errors as relevance originates from layout structure of documents with heterogeneous layouts combining figures, tables, and text. We make document layout learnable without changing inference. We propose a multimodal encoder that augments local patch representations with a global layout embedding, trained via textual descriptions encoding document layout information. Across four ViDoRe-v2 datasets, our model improves over the strongest architecturally comparable ColPali/ColQwen baseline by +2.4 nDCG@5 and +2.3 MAP@5, with statistically significant per-dataset gains over ColQwen.
Problem

Research questions and friction points this paper is trying to address.

Visual Document Retrieval
Late Interaction
Global Layout
Layout Structure
Document Retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

global layout embedding
textual supervision
late-interaction retrieval
visual document retrieval
multimodal encoder