An empirical study of the effect of video encoders on Temporal Video Grounding

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Prior work on temporal video grounding lacks systematic evaluation of video encoders, leading to architecture-specific overfitting. Method: This paper conducts the first end-to-end comparative study of three dominant video encoder paradigms—CNNs, temporal reasoning models (e.g., RNNs), and Transformers—across three standard benchmarks (Charades-STA, ActivityNet-Captions, YouCookII), integrating each uniformly into canonical grounding architectures. Contribution/Results: We reveal complementary feature representations across encoder types and identify consistent, model-specific error patterns in localization. Empirical results demonstrate that encoder choice significantly impacts grounding accuracy, with distinct error distributions tied to architectural inductive biases. Our findings provide empirical grounding for video representation design and propose an encoder diversity principle to mitigate architectural overfitting—thereby enhancing model robustness and generalization in temporal video grounding.

Technology Category

Application Category

📝 Abstract

Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.

Problem

Research questions and friction points this paper is trying to address.

Investigating video encoder impact on temporal video grounding performance

Exploring feature complementarity across CNN and transformer encoders

Analyzing performance variations across three video grounding benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical study of video encoder effects

Extract features using CNN and transformer encoders

Analyze performance differences and feature complementarity

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models