NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

πŸ“… 2026-03-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses a critical gap in academic information extraction by shifting focus from scholarly papers to implementation-level research artifacts in code repositories. We formally define and annotate ten categories of implementation-level entities within README files, introducing NERdMEβ€”a high-quality named entity recognition dataset comprising 200 expert-annotated READMEs and over 10,000 entity spans. Leveraging this resource, we conduct baseline experiments with large language models and fine-tuned Transformers, revealing substantial differences between implementation-level and paper-level entities. Furthermore, we demonstrate the practical utility of our approach through downstream entity linking tasks, showing its effectiveness in research artifact discovery and metadata integration. This study thus establishes the first systematic framework for semantic information extraction from code repositories, filling a longstanding void in academic knowledge extraction.

Technology Category

Application Category

πŸ“ Abstract
Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.
Problem

Research questions and friction points this paper is trying to address.

Named Entity Recognition
Code Repositories
README Files
Scholarly Information Extraction
Research Artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Named Entity Recognition
README files
Research Artifacts
Scholarly Information Extraction
Implementation-level Metadata
πŸ”Ž Similar Papers
No similar papers found.