ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing multimodal document retrieval methods naively adapt text-only techniques, neglecting structural and visual features in document encoding, training objectives, and similarity modeling. To address this, we propose ColMate—a novel framework that unifies OCR-aware pretraining, masked cross-modal contrastive learning, and a lightweight late-interaction mechanism. Specifically, OCR-aware pretraining enhances layout and glyph representation; masked cross-modal contrastive loss improves fine-grained semantic alignment between text and vision modalities; and the late-interaction scorer enables token-level matching without early fusion overhead. Evaluated on the ViDoRe V2 benchmark, ColMate achieves a 3.61% absolute improvement over prior state-of-the-art. Moreover, it demonstrates superior cross-domain generalization on out-of-distribution datasets. ColMate establishes a new paradigm for multimodal document retrieval that is both structurally aware and cross-modally robust.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in multimodal document retrieval techniques

Bridging multimodal representation learning with document retrieval

Improving retrieval performance for visually rich documents

Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR-based pretraining objective for document understanding

Self-supervised masked contrastive learning objective

Late interaction scoring for multimodal document structures

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs