๐ค AI Summary
To address the challenges of fine-grained information localization and retrieval of visual-domain-specific content (e.g., abbreviations) in online lecture videos, this paper proposes a multi-agent multimodal joint indexing framework. Methodologically, it introduces three novel components: (1) a vision-language model (VLM)-driven speech correction module; (2) a prior-knowledge-enhanced visual understanding module; and (3) a critic-agent-guided iterative visual self-reflection mechanism. The framework integrates VLMs, automatic speech recognition (ASR), multimodal alignment, and multi-agent coordination to achieve semantic video segmentation, cross-modal content extraction, and joint semantic indexing. Evaluated on the LPM benchmark and proprietary enterprise datasets, our approach significantly outperforms existing baselines. Notably, it is the first to enable precise retrieval of terms and abbreviations appearing exclusively in slidesโthereby substantially improving indexing granularity and reliability.
๐ Abstract
In recent years, online lecture videos have become an increasingly popular resource for acquiring new knowledge. Systems capable of effectively understanding/indexing lecture videos are thus highly desirable, enabling downstream tasks like question answering to help users efficiently locate specific information within videos. This work proposes PreMind, a novel multi-agent multimodal framework that leverages various large models for advanced understanding/indexing of presentation-style videos. PreMind first segments videos into slide-presentation segments using a Vision-Language Model (VLM) to enhance modern shot-detection techniques. Each segment is then analyzed to generate multimodal indexes through three key steps: (1) extracting slide visual content, (2) transcribing speech narratives, and (3) consolidating these visual and speech contents into an integrated understanding. Three innovative mechanisms are also proposed to improve performance: leveraging prior lecture knowledge to refine visual understanding, detecting/correcting speech transcription errors using a VLM, and utilizing a critic agent for dynamic iterative self-reflection in vision analysis. Compared to traditional video indexing methods, PreMind captures rich, reliable multimodal information, allowing users to search for details like abbreviations shown only on slides. Systematic evaluations on the public LPM dataset and an internal enterprise dataset are conducted to validate PreMind's effectiveness, supported by detailed analyses.