Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

📅 2024-03-21
🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence
📈 Citations: 2
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the insufficient synergy between static semantic modeling and temporal dynamic perception in cross-modal semantic alignment for Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG). To this end, we propose three key innovations: (1) a residual MLP architecture to enhance global cross-modal semantic interaction; (2) a diffusion-based video segment graph with sparse temporal masking to model long-range temporal dependencies; and (3) a multi-kernel temporal Gaussian filter inspired by sustained neural activity in the human brain, improving sensitivity to segment boundaries. Our method achieves state-of-the-art performance across six NLVG and SLVG benchmarks—e.g., 38.88% R@1, IoU@0.7 on ActivityNet Captions—while accelerating inference by 1.56×. Additionally, we publicly release two new SLVG datasets to support future research.

Technology Category

Application Category

📝 Abstract
Inspired by the activity-silent and persistent activity mechanisms in human visual perception biology, we design a Unified Static and Dynamic Network (UniSDNet), to learn the semantic association between the video and text/audio queries in a cross-modal environment for efficient video grounding. For static modeling, we devise a novel residual structure (ResMLP) to boost the global comprehensive interaction between the video segments and queries, achieving more effective semantic enhancement/supplement. For dynamic modeling, we effectively exploit three characteristics of the persistent activity mechanism in our network design for a better video context comprehension. Specifically, we construct a diffusely connected video clip graph on the basis of 2D sparse temporal masking to reflect the "short-term effect" relationship. We innovatively consider the temporal distance and relevance as the joint "auxiliary evidence clues" and design a multi-kernel Temporal Gaussian Filter to expand the context clue into high-dimensional space, simulating the "complex visual perception", and then conduct element level filtering convolution operations on neighbour clip nodes in message passing stage for finally generating and ranking the candidate proposals. Our UniSDNet is applicable to both Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG) tasks. Our UniSDNet achieves SOTA performance on three widely used datasets for NLVG, as well as three datasets for SLVG, e.g., reporting new records at 38.88% $R@1,IoU@0.7$ on ActivityNet Captions and 40.26% $R@1,IoU@0.5$ on TACoS. To facilitate this field, we collect two new datasets (Charades-STA Speech and TACoS Speech) for SLVG task. Meanwhile, the inference speed of our UniSDNet is 1.56× faster than the strong multi-query benchmark. Code is available at: https://github.com/xian-sh/UniSDNet.
Problem

Research questions and friction points this paper is trying to address.

Efficient video grounding via static and dynamic modeling
Cross-modal semantic association between video and text/audio
Improving video context comprehension with biological mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Static and Dynamic Network for video grounding
ResMLP residual structure enhances global semantic interaction
Multi-kernel Temporal Gaussian Filter simulates visual perception
🔎 Similar Papers
No similar papers found.
J
Jingjing Hu
Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education, School of Computer Science and Information Engineering (School of Artificial Intelligence), Hefei University of Technology (HFUT), and Intelligent Interconnected Systems Laboratory of Anhui Province (HFUT), Hefei, 230601, China
Dan Guo
Dan Guo
IEEE senior member, Professor, Hefei University of Technology
Multimedia ComputingArtificial Intelligence
K
Kun Li
Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education, School of Computer Science and Information Engineering (School of Artificial Intelligence), Hefei University of Technology (HFUT), and Intelligent Interconnected Systems Laboratory of Anhui Province (HFUT), Hefei, 230601, China
Z
Zhan Si
Department of Chemistry and Centre for Atomic Engineering of Advanced Materials, Anhui University, Hefei, Anhui 230601, P.R. China
X
Xun Yang
Department of Electronic Engineering and Information Science, School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China
X
Xiaojun Chang
Faculty of Engineering and Information Technology, University of Science and Technology of China, Hefei 230026, China
M
Meng Wang
Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education, School of Computer Science and Information Engineering (School of Artificial Intelligence), Hefei University of Technology (HFUT), and Intelligent Interconnected Systems Laboratory of Anhui Province (HFUT), Hefei, 230601, China; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230026, China