🤖 AI Summary
Existing IR evaluation heavily relies on static, manually annotated benchmarks, which are costly and inefficient for covering emerging domains and multilingual scenarios. This paper introduces the first LLM-driven, automated, heterogeneous, and dynamically evolving IR evaluation benchmark. Our method employs a prompt-engineering–guided data synthesis paradigm integrated with real-world corpora to enable on-demand generation of cross-domain, cross-lingual, and multi-task test collections. We design a multi-dimensional quality verification and alignment evaluation framework, ensuring high fidelity between synthetic and human annotations (average relevance > 0.92). The benchmark supports low-cost, rapid scalability and unified modeling across diverse IR tasks. All resources—including prompts, synthetic datasets, and evaluation tools—are fully open-sourced and have been adopted as standard evaluation infrastructure by multiple leading IR research teams.
📝 Abstract
Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.