🤖 AI Summary
Current video retrieval paradigms are constrained by narrow-domain benchmarks and single-task training, lacking diagnostic evaluation of multidimensional generalization capabilities. To address this, we propose a unified video retrieval co-design framework. First, we introduce UVRB—the first diagnostic, general-purpose benchmark covering 16 diverse datasets—exposing critical deficiencies in existing benchmarks regarding cross-domain and cross-task generalization assessment. Second, we design a scalable synthetic pipeline generating 1.55 million high-quality multimodal video–text pairs. Third, we propose a multimodal pyramid curriculum learning framework that enables progressive training—from local to global, simple to complex—based on semantic hierarchy, yielding the General Video Embedding (GVE) model. Evaluated on UVRB, GVE achieves state-of-the-art zero-shot transfer performance, with substantial gains in cross-domain and cross-task generalization. This validates the effectiveness of our integrated co-design of evaluation, data, and model.
📝 Abstract
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.