Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of efficient large-scale video retrieval under strict latency constraints, this paper proposes a lightweight end-to-end video content retrieval framework. Methodologically: (1) it restructures the core retrieval pipeline by integrating a lightweight vision-language model (VLM) for cross-modal semantic alignment; (2) it incorporates an optimized multilingual OCR (Vintern-1B-v3.5) and real-time speech recognition (faster-whisper), coupled with FFmpeg-based efficient keyframe extraction; and (3) it features an intuitive interactive interface to enhance user operational efficiency. Experiments demonstrate up to 75% reduction in retrieval latency, a 12.3% improvement in mean Average Precision (mAP), and significantly increased user satisfaction. To our knowledge, this is the first work to deeply couple a lightweight VLM with real-time multimodal parsing—achieving a balanced trade-off among accuracy, speed, and usability—and exhibiting strong practicality and deployability in real-world large-scale video search scenarios.

Technology Category

Application Category

📝 Abstract
The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.
Problem

Research questions and friction points this paper is trying to address.

Optimizing video retrieval speed for large datasets under time constraints
Enhancing multilingual text and speech recognition efficiency in video processing
Improving user interface responsiveness for non-expert video content retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fast keyframe extraction using ffmpeg for preprocessing
Lightweight vision-language models for quick question answering
Redesigned user interface for improved responsiveness and accessibility
🔎 Similar Papers
No similar papers found.
H
Huy M. Le
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE
Dat Tien Nguyen
Dat Tien Nguyen
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE
P
Phuc Binh Nguyen
University of Information Technology, Ho Chi Minh City, Vietnam
G
Gia-Bao Le-Tran
University of Information Technology, Ho Chi Minh City, Vietnam
P
Phu Truong Thien
University of Information Technology, Ho Chi Minh City, Vietnam
C
Cuong Dinh
University of Information Technology, Ho Chi Minh City, Vietnam
M
Minh Nguyen
University of Information Technology, Ho Chi Minh City, Vietnam
N
Nga Nguyen
University of Information Technology, Ho Chi Minh City, Vietnam
T
Thuy T. N. Nguyen
University of Information Technology, Ho Chi Minh City, Vietnam
H
Huy Gia Ngo
University of Information Technology, Ho Chi Minh City, Vietnam
T
Tan Nhat Nguyen
University of Information Technology, Ho Chi Minh City, Vietnam
Binh T. Nguyen
Binh T. Nguyen
VinUniversity
statisticsoptimal transport
Monojit Choudhury
Monojit Choudhury
Professor of Natural Language Processing, MBZUAI
Natural Language ProcessingLarge Language ModelsEthics of AIComputational Social Science