🤖 AI Summary
Existing RAG benchmarks are heavily English-centric, lacking dynamic, updatable evaluation resources for low-resource languages such as Russian. Method: We introduce DRAGON—the first dynamic RAG benchmark for Russian news domains—built upon continuously crawled and updated news articles and public documents. It innovatively leverages knowledge graphs to automatically generate questions spanning four dimensions: factual queries, temporal reasoning, multi-hop inference, and contextual evolution, thereby faithfully modeling real-world information retrieval dynamics. Contribution/Results: DRAGON provides an end-to-end evaluation pipeline with multidimensional metrics (e.g., retrieval quality, generation faithfulness, temporal sensitivity), alongside a fully open-sourced framework, toolchain, and dataset, and hosts a sustained leaderboard. It fills a critical gap in non-English dynamic RAG evaluation and establishes a new paradigm for assessing cross-lingual RAG systems’ robustness and evolutionary capabilities.
📝 Abstract
Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. Although there exist multiple RAG benchmarks for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments.
In this work, we present DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian on a changing news corpora. DRAGON is built upon a regularly updated corpus of Russian news and public documents and supports comprehensive evaluation of both the retriever and generator components. Question generation is performed automatically with the use of Knowledge Graph constructed from the corpus and enables the extraction of four core question types aligned with distinct subgraph patterns. We release a complete evaluation framework comprising the pipeline for automatic question generation, evaluation scripts, which are potentially reusable for other languages and multilingual settings, and benchmark data. We also launch a public leaderboard to encourage community participation and comparison.