Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses text-to-video retrieval (T2VR), proposing a lightweight and efficient fine-grained late-interaction framework. Methodologically, it introduces the first contextualized late-interaction mechanism in the video domain to enable spatiotemporal token-level cross-modal alignment; incorporates a bidirectional expansion strategy between query and visual features; and devises a dual-sigmoid contrastive loss that jointly preserves discriminability and composability. Built upon the ColBERT dual-encoder architecture, the framework supports efficient inference and representation reuse. Evaluated on mainstream T2VR benchmarks, the method achieves state-of-the-art performance—significantly outperforming existing dual-encoder models—while maintaining low computational overhead and strong generalization capability.

Technology Category

Application Category

📝 Abstract

In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.

Problem

Research questions and friction points this paper is trying to address.

Enhancing text-to-video retrieval via fine-grained similarity assessment

Improving video content encoding with spatial-temporal token interaction

Boosting retrieval performance using dual sigmoid loss and expansions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained spatial and temporal token-wise interaction

Query and visual expansions for enhanced retrieval

Dual sigmoid loss for compatible representations

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs