AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing IR evaluation heavily relies on static, manually annotated benchmarks, which are costly and inefficient for covering emerging domains and multilingual scenarios. This paper introduces the first LLM-driven, automated, heterogeneous, and dynamically evolving IR evaluation benchmark. Our method employs a prompt-engineering–guided data synthesis paradigm integrated with real-world corpora to enable on-demand generation of cross-domain, cross-lingual, and multi-task test collections. We design a multi-dimensional quality verification and alignment evaluation framework, ensuring high fidelity between synthetic and human annotations (average relevance > 0.92). The benchmark supports low-cost, rapid scalability and unified modeling across diverse IR tasks. All resources—including prompts, synthetic datasets, and evaluation tools—are fully open-sourced and have been adopted as standard evaluation infrastructure by multiple leading IR research teams.

Technology Category

Application Category

📝 Abstract

Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.

Problem

Research questions and friction points this paper is trying to address.

Automated benchmark for evaluating emerging IR domains

Heterogeneous testing data across tasks and languages

Dynamic expansion of domains for comprehensive evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated testing data generation using LLMs

Heterogeneous data across tasks and languages

Dynamic expansion of domains and languages

🔎 Similar Papers

No similar papers found.