🤖 AI Summary
Existing multimodal document retrieval methods naively adapt text-only techniques, neglecting structural and visual features in document encoding, training objectives, and similarity modeling. To address this, we propose ColMate—a novel framework that unifies OCR-aware pretraining, masked cross-modal contrastive learning, and a lightweight late-interaction mechanism. Specifically, OCR-aware pretraining enhances layout and glyph representation; masked cross-modal contrastive loss improves fine-grained semantic alignment between text and vision modalities; and the late-interaction scorer enables token-level matching without early fusion overhead. Evaluated on the ViDoRe V2 benchmark, ColMate achieves a 3.61% absolute improvement over prior state-of-the-art. Moreover, it demonstrates superior cross-domain generalization on out-of-distribution datasets. ColMate establishes a new paradigm for multimodal document retrieval that is both structurally aware and cross-modally robust.
📝 Abstract
Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.