Spiking Variational Graph Representation Inference for Video Summarization

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing video summarization methods struggle to model long-range temporal dependencies, ensure semantic coherence, and robustly fuse multi-channel features—often suffering from noise interference. To address these challenges, we propose the Spiking Variational Graph Network (SpiVG), the first framework to incorporate spiking neural networks (SNNs) into keyframe selection. SpiVG dynamically aggregates graph-based reasoning to disentangle object consistency from semantic coherence, and introduces a variational reconstruction module that jointly optimizes the evidence lower bound (ELBO) and enforces posterior regularization to suppress feature noise and overfitting. Evaluated on four major benchmarks—SumMe, TVSum, VideoSum, and QFVS—SpiVG achieves significant improvements over state-of-the-art methods, balancing summarization accuracy with computational efficiency. Our approach establishes a novel paradigm for lightweight, robust short-video summarization.

Technology Category

Application Category

📝 Abstract

With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.

Problem

Research questions and friction points this paper is trying to address.

Captures global temporal dependencies in video summarization

Maintains semantic coherence across video content frames

Reduces noise during multi-channel feature fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spiking Neural Networks for keyframe extraction

Dynamic Aggregation Graph for contextual reasoning

Variational Inference with ELBO optimization

🔎 Similar Papers

Unsupervised Video Summarization via Reinforcement Learning and a Trained Evaluator

2024-07-05arXiv.orgCitations: 1