Spiking Variational Graph Representation Inference for Video Summarization

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video summarization methods struggle to model long-range temporal dependencies, ensure semantic coherence, and robustly fuse multi-channel features—often suffering from noise interference. To address these challenges, we propose the Spiking Variational Graph Network (SpiVG), the first framework to incorporate spiking neural networks (SNNs) into keyframe selection. SpiVG dynamically aggregates graph-based reasoning to disentangle object consistency from semantic coherence, and introduces a variational reconstruction module that jointly optimizes the evidence lower bound (ELBO) and enforces posterior regularization to suppress feature noise and overfitting. Evaluated on four major benchmarks—SumMe, TVSum, VideoSum, and QFVS—SpiVG achieves significant improvements over state-of-the-art methods, balancing summarization accuracy with computational efficiency. Our approach establishes a novel paradigm for lightweight, robust short-video summarization.

Technology Category

Application Category

📝 Abstract
With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.
Problem

Research questions and friction points this paper is trying to address.

Captures global temporal dependencies in video summarization
Maintains semantic coherence across video content frames
Reduces noise during multi-channel feature fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spiking Neural Networks for keyframe extraction
Dynamic Aggregation Graph for contextual reasoning
Variational Inference with ELBO optimization
🔎 Similar Papers
No similar papers found.
Wenrui Li
Wenrui Li
Assistant Professor, University of Connecticut
StatisticsNetwork scienceBiostatistics
W
Wei Han
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, and also with Harbin Institute of Technology Suzhou Research Institute, Suzhou 215104, China
L
Liang-Jian Deng
School of Mathematical Sciences/Multi-Hazard Early Warning Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, China
Ruiqin Xiong
Ruiqin Xiong
Peking University
video codingimage and video processing
Xiaopeng Fan
Xiaopeng Fan
Professor, Harbin Institute of Technology
Video/ImageWireless