🤖 AI Summary
In video moment retrieval, conventional methods suffer from inadequate fine-grained semantic-temporal alignment and biased uncertainty modeling: pretrained models struggle with complex or ambiguous queries, and uncertainty estimation is systematically skewed—assigning high uncertainty to easy rather than hard samples. To address these limitations, we propose DEMR, a debiased evidential learning framework. DEMR introduces three key innovations: (1) a Reflective Flipped Fusion module for cross-modal alignment, (2) a query reconstruction task to enhance semantic grounding, and (3) a Geom-regularizer enforcing geometric consistency in evidence space. Built upon Deep Evidential Regression, DEMR achieves robust semantic-temporal alignment and unbiased uncertainty quantification. Experiments demonstrate significant improvements in retrieval accuracy, robustness, and interpretability on standard benchmarks (e.g., ActivityNet-CD, Charades-CD) and newly curated debiased datasets.
📝 Abstract
In the domain of moment retrieval, accurately identifying temporal segments within videos based on natural language queries remains challenging. Traditional methods often employ pre-trained models that struggle with fine-grained information and deterministic reasoning, leading to difficulties in aligning with complex or ambiguous moments. To overcome these limitations, we explore Deep Evidential Regression (DER) to construct a vanilla Evidential baseline. However, this approach encounters two major issues: the inability to effectively handle modality imbalance and the structural differences in DER's heuristic uncertainty regularizer, which adversely affect uncertainty estimation. This misalignment results in high uncertainty being incorrectly associated with accurate samples rather than challenging ones. Our observations indicate that existing methods lack the adaptability required for complex video scenarios. In response, we propose Debiased Evidential Learning for Moment Retrieval (DEMR), a novel framework that incorporates a Reflective Flipped Fusion (RFF) block for cross-modal alignment and a query reconstruction task to enhance text sensitivity, thereby reducing bias in uncertainty estimation. Additionally, we introduce a Geom-regularizer to refine uncertainty predictions, enabling adaptive alignment with difficult moments and improving retrieval accuracy. Extensive testing on standard datasets and debiased datasets ActivityNet-CD and Charades-CD demonstrates significant enhancements in effectiveness, robustness, and interpretability, positioning our approach as a promising solution for temporal-semantic robustness in moment retrieval. The code is publicly available at https://github.com/KaijingOfficial/DEMR.