Enhancing Weakly Supervised Video Grounding via Diverse Inference Strategies for Boundary and Prediction Selection

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Weakly supervised temporal sentence grounding faces the challenge of lacking explicit temporal boundary annotations. Existing Gaussian-based proposal methods suffer from two key limitations: rigid boundary generation and the neglect of proposal quality in top-1 prediction. To address these, this paper proposes dynamic multi-Gaussian fusion for boundary prediction, decoupling boundary generation from optimal proposal selection during inference. Furthermore, we design a quality-aware selection mechanism that weights proposals based on both confidence scores and overlap consistency—enabling plug-and-play optimization without retraining. Notably, our approach is the first to jointly model boundary diversity and proposal quality differences purely at inference time. Experiments demonstrate consistent improvements of 2.3–3.7 percentage points in mean Average Precision (mAP) on ActivityNet Captions and Charades-STA, significantly outperforming existing weakly supervised methods while incurring no additional training overhead.

Technology Category

Application Category

📝 Abstract

Weakly supervised video grounding aims to localize temporal boundaries relevant to a given query without explicit ground-truth temporal boundaries. While existing methods primarily use Gaussian-based proposals, they overlook the importance of (1) boundary prediction and (2) top-1 prediction selection during inference. In their boundary prediction, boundaries are simply set at half a standard deviation away from a Gaussian mean on both sides, which may not accurately capture the optimal boundaries. In the top-1 prediction process, these existing methods rely heavily on intersections with other proposals, without considering the varying quality of each proposal. To address these issues, we explore various inference strategies by introducing (1) novel boundary prediction methods to capture diverse boundaries from multiple Gaussians and (2) new selection methods that take proposal quality into account. Extensive experiments on the ActivityNet Captions and Charades-STA datasets validate the effectiveness of our inference strategies, demonstrating performance improvements without requiring additional training.

Problem

Research questions and friction points this paper is trying to address.

Improving weakly supervised video grounding via diverse boundary prediction strategies

Addressing limitations in top-1 prediction selection during inference

Enhancing accuracy without additional training using novel inference methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse boundary prediction from multiple Gaussians

Proposal quality-aware top-1 prediction selection

No additional training for performance improvement

🔎 Similar Papers

No similar papers found.