🤖 AI Summary
Existing code information retrieval (IR) research suffers from a lack of comprehensive benchmarks, task monotony, and insufficient domain coverage, hindering holistic evaluation of model capabilities. To address this, we introduce CoIR—the first comprehensive, multi-task, cross-domain benchmark for code IR—encompassing 10 datasets, 8 distinct retrieval tasks, and 7 programming languages/domains. CoIR is the first to systematically formalize diversity-oriented evaluation dimensions for code retrieval and ensures full compatibility with MTEB and BEIR standards. We release an open-source, installable Python framework enabling fair, reproducible evaluation. Standardized assessment across nine state-of-the-art IR models reveals substantial performance degradation on code-specific tasks compared to natural-language IR. CoIR has been widely adopted by the community, catalyzing the development of multiple novel code IR models and bridging a critical gap in extending general-purpose IR methodologies to code.
📝 Abstract
Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Addressing this gap, we present COIR (Code Information Retrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. COIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains. We first discuss the construction of COIR and its diverse dataset composition. Further, we evaluate nine widely used retrieval models using COIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. To facilitate easy adoption and integration within existing research workflows, COIR has been developed as a user-friendly Python framework, readily installable via pip. It shares same data schema as other popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark evaluations. Through COIR, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems https://github.com/CoIR-team/coir.