Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the challenge of efficient large-scale video retrieval under strict latency constraints, this paper proposes a lightweight end-to-end video content retrieval framework. Methodologically: (1) it restructures the core retrieval pipeline by integrating a lightweight vision-language model (VLM) for cross-modal semantic alignment; (2) it incorporates an optimized multilingual OCR (Vintern-1B-v3.5) and real-time speech recognition (faster-whisper), coupled with FFmpeg-based efficient keyframe extraction; and (3) it features an intuitive interactive interface to enhance user operational efficiency. Experiments demonstrate up to 75% reduction in retrieval latency, a 12.3% improvement in mean Average Precision (mAP), and significantly increased user satisfaction. To our knowledge, this is the first work to deeply couple a lightweight VLM with real-time multimodal parsing—achieving a balanced trade-off among accuracy, speed, and usability—and exhibiting strong practicality and deployability in real-world large-scale video search scenarios.

Technology Category

Application Category

📝 Abstract

The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.

Problem

Research questions and friction points this paper is trying to address.

Optimizing video retrieval speed for large datasets under time constraints

Enhancing multilingual text and speech recognition efficiency in video processing

Improving user interface responsiveness for non-expert video content retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fast keyframe extraction using ffmpeg for preprocessing

Lightweight vision-language models for quick question answering

Redesigned user interface for improved responsiveness and accessibility

🔎 Similar Papers

A Comprehensive Survey on Retrieval Methods in Recommender Systems