Moment Quantization for Video Temporal Grounding

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak discriminability of continuous features in temporal video grounding—particularly the difficulty in distinguishing relevant from irrelevant moments—this paper proposes a discrete moment quantization framework. Methodologically, it introduces (1) a learnable discrete codebook coupled with a lossless, clustering-based soft matching mechanism, replacing conventional hard quantization; (2) prior-guided codebook initialization and joint projection to enhance codebook representational quality; and (3) a plug-and-play architectural design for seamless integration. The approach achieves significant improvements over state-of-the-art methods on six mainstream benchmarks. Qualitative analysis demonstrates its ability to effectively aggregate semantically relevant segments while suppressing irrelevant temporal instances, thereby substantially enhancing discriminative capability for temporal localization.

Technology Category

Application Category

📝 Abstract
Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.
Problem

Research questions and friction points this paper is trying to address.

Enhance discrimination between relevant and irrelevant video moments
Quantize video into discrete vectors for better temporal grounding
Improve moment-codeword matching without losing useful information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantizes video into discrete vectors for better discrimination
Uses learnable moment codebook for moment-codeword matching
Employs prior-initialization and joint-projection strategies
🔎 Similar Papers
No similar papers found.
Xiaolong Sun
Xiaolong Sun
Xi'an Jiaotong University
multimodal learning
L
Le Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Sanping Zhou
Sanping Zhou
Xi'an Jiaotong University
Computer VisionMachine Learning
Liushuai Shi
Liushuai Shi
Xi'an Jiaotong University
Deep learningAutonomous Driving
K
Kun Xia
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Mengnan Liu
Mengnan Liu
Xi'an Jiaotong University
Video Understanding
Yabing Wang
Yabing Wang
Xi’an Jiaotong University
multimodal learning
Gang Hua
Gang Hua
Director of Applied Science, AI, Amazon.com, Inc., IEEE & IAPR Fellow
Computer VisionMachine LearningArtificial IntelligenceRoboticsMultimedia