🤖 AI Summary
This work addresses the challenge of deploying multi-vector models in visual document retrieval due to their high storage and computational costs. To this end, we propose ColChunk, a novel framework that introduces multimodal post-hoc chunking to this task for the first time. ColChunk adaptively groups image patch embeddings via hierarchical clustering informed by 2D positional priors, yielding compact, context-aware multi-vector representations. This approach overcomes the limitations of conventional pruning or fixed-token strategies by enabling content-aware compression while preserving spatial semantic coherence. Evaluated across 24 visual document retrieval datasets, ColChunk reduces storage requirements by over 90% and achieves an average improvement of 9 nDCG@5 points over representative single-vector baselines.
📝 Abstract
Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.