๐ค AI Summary
This work addresses the semantic asymmetry between concise textual queries and rich video content in partial video retrieval, which leads to semantic ambiguity and sparse temporal supervision. To tackle this challenge, the paper introduces Holmes, a novel framework that, for the first time, incorporates explicit uncertainty modeling into this task. At the inter-video level, it interprets cross-video similarity through a Dirichlet distribution; at the intra-video level, it achieves soft queryโsegment alignment via flexible optimal transport with an adaptive dustbin. Holmes adheres to a tripartite principle, jointly optimizing fine-grained query grounding and query-adaptive calibration while aggregating multi-granular cross-modal evidence through hierarchical evidential learning. Experiments demonstrate that Holmes significantly outperforms existing methods across multiple benchmarks, substantially improving retrieval performance.
๐ Abstract
Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICML26-Holmes.