🤖 AI Summary
ViDoRe V1 has reached performance saturation (nDCG@5 > 90%), limiting its ability to discriminate between models.
Method: We introduce ViDoRe V2—a next-generation multilingual visual document retrieval benchmark—addressing V1’s limitations via a novel “blind-context query” paradigm, long-range cross-document retrieval, human-AI hybrid query generation, and coverage of four real-world multilingual datasets. Methodologically, V2 integrates synthetic data generation, expert human validation, multilingual alignment-aware evaluation, and a dynamic nDCG@5 testing framework.
Contribution/Results: Experiments reveal persistent bottlenecks in state-of-the-art models regarding multilingual generalization and long-context understanding, confirming V2’s enhanced discriminative power and real-world relevance. As a living benchmark, ViDoRe V2 enables sustainable, iterative advancement in visual retrieval research.
📝 Abstract
The ViDoRe Benchmark V1 was approaching saturation with top models exceeding 90% nDCG@5, limiting its ability to discern improvements. ViDoRe Benchmark V2 introduces realistic, challenging retrieval scenarios via blind contextual querying, long and cross-document queries, and a hybrid synthetic and human-in-the-loop query generation process. It comprises four diverse, multilingual datasets and provides clear evaluation instructions. Initial results demonstrate substantial room for advancement and highlight insights on model generalization and multilingual capability. This benchmark is designed as a living resource, inviting community contributions to maintain relevance through future evaluations.