π€ AI Summary
This work addresses a critical gap in academic information extraction by shifting focus from scholarly papers to implementation-level research artifacts in code repositories. We formally define and annotate ten categories of implementation-level entities within README files, introducing NERdMEβa high-quality named entity recognition dataset comprising 200 expert-annotated READMEs and over 10,000 entity spans. Leveraging this resource, we conduct baseline experiments with large language models and fine-tuned Transformers, revealing substantial differences between implementation-level and paper-level entities. Furthermore, we demonstrate the practical utility of our approach through downstream entity linking tasks, showing its effectiveness in research artifact discovery and metadata integration. This study thus establishes the first systematic framework for semantic information extraction from code repositories, filling a longstanding void in academic knowledge extraction.
π Abstract
Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.