Language-Guided Graph Representation Learning for Video Summarization

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video summarization methods struggle to capture long-range semantic dependencies and lack flexibility in adapting to multimodal user preferences; moreover, temporal proximity between frames often misaligns with semantic similarity. To address these issues, we propose a Language-Guided Graph Representation Network (LG-GRN), which constructs forward, backward, and undirected temporal graphs to jointly model local temporal dynamics and global context. A dual-threshold graph convolution mechanism is introduced to explicitly distinguish strong from weak semantic associations. Cross-modal embeddings, guided by natural language queries, enable text-driven, customizable summary generation. Summarization is formulated as a mixture of Bernoulli distributions, optimized via the Expectation-Maximization (EM) algorithm. Extensive experiments demonstrate that LG-GRN significantly outperforms state-of-the-art methods across multiple benchmarks, achieving an 87.8% inference speedup and a 91.7% reduction in parameter count—delivering superior efficiency, personalization, and semantic consistency.

Technology Category

Application Category

📝 Abstract
With the rapid growth of video content on social media, video summarization has become a crucial task in multimedia processing. However, existing methods face challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an intra-graph relational reasoning module with a dual-threshold graph convolution mechanism, which distinguishes semantically relevant frames from irrelevant ones between nodes. Additionally, our proposed language-guided cross-modal embedding module generates video summaries with specific textual descriptions. We model the summary generation output as a mixture of Bernoulli distribution and solve it with the EM algorithm. Experimental results show that our method outperforms existing approaches across multiple benchmarks. Moreover, we proposed LGRLN reduces inference time and model parameters by 87.8% and 91.7%, respectively. Our codes and pre-trained models are available at https://github.com/liwrui/LGRLN.
Problem

Research questions and friction points this paper is trying to address.

Capturing global dependencies in video content
Accommodating multimodal user customization requirements
Addressing semantic-temporal proximity mismatch in frames
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-guided graph network for video summarization
Dual-threshold graph convolution for semantic reasoning
Cross-modal embedding with textual description guidance
🔎 Similar Papers
Wenrui Li
Wenrui Li
Assistant Professor, University of Connecticut
StatisticsNetwork scienceBiostatistics
W
Wei Han
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, and also with Harbin Institute of Technology Suzhou Research Institute, Suzhou 215104, China
H
Hengyu Man
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, and also with Harbin Institute of Technology Suzhou Research Institute, Suzhou 215104, China
Wangmeng Zuo
Wangmeng Zuo
School of Computer Science and Technology, Harbin Institute of Technology
Computer VisionImage ProcessingGenerative AIDeep LearningBiometrics
Xiaopeng Fan
Xiaopeng Fan
Professor, Harbin Institute of Technology
Video/ImageWireless
Y
Yonghong Tian
School of AI for Science, the Shenzhen Graduate School, Peking University, Shenzhen, China, the Peng Cheng Laboratory, Shenzhen, China and also with the School of Computer Science, Peking University, Beijing, China