🤖 AI Summary
To address the challenge of efficient large-scale video retrieval under strict latency constraints, this paper proposes a lightweight end-to-end video content retrieval framework. Methodologically: (1) it restructures the core retrieval pipeline by integrating a lightweight vision-language model (VLM) for cross-modal semantic alignment; (2) it incorporates an optimized multilingual OCR (Vintern-1B-v3.5) and real-time speech recognition (faster-whisper), coupled with FFmpeg-based efficient keyframe extraction; and (3) it features an intuitive interactive interface to enhance user operational efficiency. Experiments demonstrate up to 75% reduction in retrieval latency, a 12.3% improvement in mean Average Precision (mAP), and significantly increased user satisfaction. To our knowledge, this is the first work to deeply couple a lightweight VLM with real-time multimodal parsing—achieving a balanced trade-off among accuracy, speed, and usability—and exhibiting strong practicality and deployability in real-world large-scale video search scenarios.
📝 Abstract
The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.