Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge of deploying multi-vector models in visual document retrieval due to their high storage and computational costs. To this end, we propose ColChunk, a novel framework that introduces multimodal post-hoc chunking to this task for the first time. ColChunk adaptively groups image patch embeddings via hierarchical clustering informed by 2D positional priors, yielding compact, context-aware multi-vector representations. This approach overcomes the limitations of conventional pruning or fixed-token strategies by enabling content-aware compression while preserving spatial semantic coherence. Evaluated across 24 visual document retrieval datasets, ColChunk reduces storage requirements by over 90% and achieves an average improvement of 9 nDCG@5 points over representative single-vector baselines.

Technology Category

Application Category

📝 Abstract

Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.

Problem

Research questions and friction points this paper is trying to address.

Visual Document Retrieval

multi-vector models

storage cost

computational cost

efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

late chunking

multi-vector retrieval

hierarchical clustering