AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Arabic video–text retrieval has long suffered from the absence of localized evaluation protocols and benchmark datasets. To address this, we propose the first Arabic-specific video–text retrieval benchmark localization framework, built upon a three-stage large language model (LLM)-driven pipeline: (1) automatic translation of English benchmarks (e.g., DiDeMo), (2) construction of a fine-grained taxonomy of translation errors, and (3) a dual-mode error detection mechanism integrating rule-based heuristics and LLM inference. Our resulting DiDeMo-AR dataset comprises 40,144 high-quality Arabic textual annotations, with error detection achieving 97% accuracy. Under zero-budget fine-tuning—i.e., without parameter updates—CLIP-style cross-lingual models attain Arabic retrieval performance within ~3 percentage points of their English counterparts, demonstrating both effectiveness and scalability of our framework.

Technology Category

Application Category

📝 Abstract

Video-to-text and text-to-video retrieval are dominated by English benchmarks (e.g. DiDeMo, MSR-VTT) and recent multilingual corpora (e.g. RUDDER), yet Arabic remains underserved, lacking localized evaluation metrics. We introduce a three-stage framework, AutoArabic, utilizing state-of-the-art large language models (LLMs) to translate non-Arabic benchmarks into Modern Standard Arabic, reducing the manual revision required by nearly fourfold. The framework incorporates an error detection module that automatically flags potential translation errors with 97% accuracy. Applying the framework to DiDeMo, a video retrieval benchmark produces DiDeMo-AR, an Arabic variant with 40,144 fluent Arabic descriptions. An analysis of the translation errors is provided and organized into an insightful taxonomy to guide future Arabic localization efforts. We train a CLIP-style baseline with identical hyperparameters on the Arabic and English variants of the benchmark, finding a moderate performance gap (about 3 percentage points at Recall@1), indicating that Arabic localization preserves benchmark difficulty. We evaluate three post-editing budgets (zero/ flagged-only/ full) and find that performance improves monotonically with more post-editing, while the raw LLM output (zero-budget) remains usable. To ensure reproducibility to other languages, we made the code available at https://github.com/Tahaalshatiri/AutoArabic.

Problem

Research questions and friction points this paper is trying to address.

Arabic lacks localized evaluation metrics for video-text retrieval benchmarks

Manual translation of benchmarks into Arabic requires substantial human effort

Existing multilingual corpora inadequately serve the Arabic language community

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage framework for Arabic video-text localization

LLM-based translation with error detection module

CLIP-style baseline evaluation across editing budgets

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs