Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

227K/year
🤖 AI Summary
This work addresses the challenge of deploying multi-vector models in visual document retrieval due to their high storage and computational costs. To this end, we propose ColChunk, a novel framework that introduces multimodal post-hoc chunking to this task for the first time. ColChunk adaptively groups image patch embeddings via hierarchical clustering informed by 2D positional priors, yielding compact, context-aware multi-vector representations. This approach overcomes the limitations of conventional pruning or fixed-token strategies by enabling content-aware compression while preserving spatial semantic coherence. Evaluated across 24 visual document retrieval datasets, ColChunk reduces storage requirements by over 90% and achieves an average improvement of 9 nDCG@5 points over representative single-vector baselines.

Technology Category

Application Category

📝 Abstract
Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.
Problem

Research questions and friction points this paper is trying to address.

Visual Document Retrieval
multi-vector models
storage cost
computational cost
efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

late chunking
multi-vector retrieval
hierarchical clustering
spatial-semantic coherence
visual document retrieval
🔎 Similar Papers
No similar papers found.