Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge in visual document retrieval where existing multi-vector architectures struggle to simultaneously preserve layout information and maintain storage efficiency. To bridge this gap, the authors propose ColParse, a novel approach that integrates document parsing into multi-vector retrieval for the first time. ColParse leverages a document parsing model to generate a small set of layout-aware sub-image embeddings, which are then fused with a global page-level vector to construct a compact yet structurally informed representation. Evaluated across multiple benchmarks and foundation models, ColParse achieves over 95% storage compression while significantly improving retrieval accuracy, thereby effectively reconciling high performance with low storage overhead.

Technology Category

Application Category

📝 Abstract

Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.

Problem

Research questions and friction points this paper is trying to address.

Visual Document Retrieval

multi-vector retrieval

storage bottleneck

layout understanding

embedding efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

layout-aware retrieval

multi-vector representation

document parsing