🤖 AI Summary
For speech-rich videos—such as online lectures and meeting recordings—sparse visual content severely limits browseability and interactivity. To address this, we propose a semantic-driven, end-to-end visualization enhancement framework that jointly leverages automatic speech recognition (ASR), natural language processing (NLP), and dynamic visualization generation. The framework automatically maps spoken content to semantically aligned visual enhancements—including keyword clouds, timeline-based summary graphs, and key-segment highlights—and integrates them into an interactive navigation interface. Unlike existing purely vision-based summarization methods, ours is the first to systematically establish a semantic–visual co-enhancement paradigm specifically designed for speech-dominated videos. Experimental evaluation demonstrates significant improvements in user comprehension accuracy (+28.6%) and interaction efficiency (37.2% reduction in task completion time), validating strong practical utility in educational and remote collaboration settings.
📝 Abstract
The widespread adoption of digital technology has ushered in a new era of digital transformation across all aspects of our lives. Online learning, social, and work activities, such as distance education, videoconferencing, interviews, and talks, have led to a dramatic increase in speech-rich video content. In contrast to other video types, such as surveillance footage, which typically contain abundant visual cues, speech-rich videos convey most of their meaningful information through the audio channel. This poses challenges for improving content consumption using existing visual-based video summarization, navigation, and exploration systems. In this paper, we present VisAug, a novel interactive system designed to enhance speech-rich video navigation and engagement by automatically generating informative and expressive visual augmentations based on the speech content of videos. Our findings suggest that this system has the potential to significantly enhance the consumption and engagement of information in an increasingly video-driven digital landscape.