Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Current video retrieval paradigms are constrained by narrow-domain benchmarks and single-task training, lacking diagnostic evaluation of multidimensional generalization capabilities. To address this, we propose a unified video retrieval co-design framework. First, we introduce UVRB—the first diagnostic, general-purpose benchmark covering 16 diverse datasets—exposing critical deficiencies in existing benchmarks regarding cross-domain and cross-task generalization assessment. Second, we design a scalable synthetic pipeline generating 1.55 million high-quality multimodal video–text pairs. Third, we propose a multimodal pyramid curriculum learning framework that enables progressive training—from local to global, simple to complex—based on semantic hierarchy, yielding the General Video Embedding (GVE) model. Evaluated on UVRB, GVE achieves state-of-the-art zero-shot transfer performance, with substantial gains in cross-domain and cross-task generalization. This validates the effectiveness of our integrated co-design of evaluation, data, and model.

Technology Category

Application Category

📝 Abstract

The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

Problem

Research questions and friction points this paper is trying to address.

Addressing structural misalignment in video retrieval benchmarks

Developing scalable data synthesis for universal video embedding

Creating multimodal curriculum to improve zero-shot generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Video Retrieval Benchmark for diagnostics

Scalable synthesis workflow generating million video-text pairs

Modality Pyramid curriculum training for diverse data generalization

🔎 Similar Papers

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval