VisAug: Facilitating Speech-Rich Web Video Navigation and Engagement with Auto-Generated Visual Augmentations

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
For speech-rich videos—such as online lectures and meeting recordings—sparse visual content severely limits browseability and interactivity. To address this, we propose a semantic-driven, end-to-end visualization enhancement framework that jointly leverages automatic speech recognition (ASR), natural language processing (NLP), and dynamic visualization generation. The framework automatically maps spoken content to semantically aligned visual enhancements—including keyword clouds, timeline-based summary graphs, and key-segment highlights—and integrates them into an interactive navigation interface. Unlike existing purely vision-based summarization methods, ours is the first to systematically establish a semantic–visual co-enhancement paradigm specifically designed for speech-dominated videos. Experimental evaluation demonstrates significant improvements in user comprehension accuracy (+28.6%) and interaction efficiency (37.2% reduction in task completion time), validating strong practical utility in educational and remote collaboration settings.

Technology Category

Application Category

📝 Abstract
The widespread adoption of digital technology has ushered in a new era of digital transformation across all aspects of our lives. Online learning, social, and work activities, such as distance education, videoconferencing, interviews, and talks, have led to a dramatic increase in speech-rich video content. In contrast to other video types, such as surveillance footage, which typically contain abundant visual cues, speech-rich videos convey most of their meaningful information through the audio channel. This poses challenges for improving content consumption using existing visual-based video summarization, navigation, and exploration systems. In this paper, we present VisAug, a novel interactive system designed to enhance speech-rich video navigation and engagement by automatically generating informative and expressive visual augmentations based on the speech content of videos. Our findings suggest that this system has the potential to significantly enhance the consumption and engagement of information in an increasingly video-driven digital landscape.
Problem

Research questions and friction points this paper is trying to address.

Enhancing navigation in speech-rich videos lacking visual cues
Improving engagement with auto-generated visual augmentations
Addressing limitations of visual-based video summarization systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-generates visual augmentations from speech
Enhances navigation for speech-rich videos
Interactive system for video engagement
🔎 Similar Papers
No similar papers found.
Baoquan Zhao
Baoquan Zhao
Sun Yat-sen University
3D point cloud processing and compressionMultimedia content analysisOpen Educational Resources
Xiaofan Ma
Xiaofan Ma
Sun Yat-sen University
Human-Computer Interaction
Q
Qianshi Pang
Sun Yat-sen University, Zhuhai, China
R
Ruomei Wang
Sun Yat-sen University, Zhuhai, China
F
Fan Zhou
Sun Yat-sen University, Guangzhou, China
S
Shujin Lin
Sun Yat-sen University, Guangzhou, China