AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Arabic video–text retrieval has long suffered from the absence of localized evaluation protocols and benchmark datasets. To address this, we propose the first Arabic-specific video–text retrieval benchmark localization framework, built upon a three-stage large language model (LLM)-driven pipeline: (1) automatic translation of English benchmarks (e.g., DiDeMo), (2) construction of a fine-grained taxonomy of translation errors, and (3) a dual-mode error detection mechanism integrating rule-based heuristics and LLM inference. Our resulting DiDeMo-AR dataset comprises 40,144 high-quality Arabic textual annotations, with error detection achieving 97% accuracy. Under zero-budget fine-tuning—i.e., without parameter updates—CLIP-style cross-lingual models attain Arabic retrieval performance within ~3 percentage points of their English counterparts, demonstrating both effectiveness and scalability of our framework.

Technology Category

Application Category

📝 Abstract
Video-to-text and text-to-video retrieval are dominated by English benchmarks (e.g. DiDeMo, MSR-VTT) and recent multilingual corpora (e.g. RUDDER), yet Arabic remains underserved, lacking localized evaluation metrics. We introduce a three-stage framework, AutoArabic, utilizing state-of-the-art large language models (LLMs) to translate non-Arabic benchmarks into Modern Standard Arabic, reducing the manual revision required by nearly fourfold. The framework incorporates an error detection module that automatically flags potential translation errors with 97% accuracy. Applying the framework to DiDeMo, a video retrieval benchmark produces DiDeMo-AR, an Arabic variant with 40,144 fluent Arabic descriptions. An analysis of the translation errors is provided and organized into an insightful taxonomy to guide future Arabic localization efforts. We train a CLIP-style baseline with identical hyperparameters on the Arabic and English variants of the benchmark, finding a moderate performance gap (about 3 percentage points at Recall@1), indicating that Arabic localization preserves benchmark difficulty. We evaluate three post-editing budgets (zero/ flagged-only/ full) and find that performance improves monotonically with more post-editing, while the raw LLM output (zero-budget) remains usable. To ensure reproducibility to other languages, we made the code available at https://github.com/Tahaalshatiri/AutoArabic.
Problem

Research questions and friction points this paper is trying to address.

Arabic lacks localized evaluation metrics for video-text retrieval benchmarks
Manual translation of benchmarks into Arabic requires substantial human effort
Existing multilingual corpora inadequately serve the Arabic language community
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage framework for Arabic video-text localization
LLM-based translation with error detection module
CLIP-style baseline evaluation across editing budgets
🔎 Similar Papers
No similar papers found.
M
Mohamed Eltahir
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
O
Osamah Sarraj
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
A
Abdulrahman Alfrihidi
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
T
Taha Alshatiri
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
M
Mohammed Khurd
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
M
Mohammed Bremoo
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Tanveer Hussain
Tanveer Hussain
Lecturer at Department of Computer Science, Edge Hill University
Computer VisionVideo SummarisationSaliency DetectionFire/Smoke Detection