🤖 AI Summary
This study systematically investigates the full lifecycle of preprint-to-journal publication transitions on bioRxiv. It addresses the challenge of ambiguous or unverifiable publication statuses by proposing the first three-category classification framework: “Published” (peer-reviewed journal articles), “Preprint-only,” and “Gray Zone” (e.g., retracted, untraceable, or status-ambiguous cases). Leveraging metadata from Crossref and the bioRxiv API, the authors integrate title/author textual similarity matching with expert human annotation to construct a large-scale, high-accuracy preprint–journal mapping dataset comprising 145,000 records. Its key contribution lies in the first formal definition and empirical identification of the “Gray Zone,” substantially improving linkage precision. The publicly released, open-source CSV dataset enables diverse research applications—including open science evaluation, scholarly communication analysis, and training of NLP models for academic text mining.
📝 Abstract
The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process. It comprises metadata for 145,517 preprints from two periods, 2016-2018 (pre-pandemic) and 2020-2022 (pandemic), retrieved via the bioRxiv and Crossref APIs. Each record includes bibliographic information such as titles, abstracts, authors, institutions, submission dates, licenses, and subject categories, alongside enriched publication metadata including journal names, publication dates, author lists, and further information. Preprints are categorized into three groups: Published (formally linked to a journal article), Preprint Only (unpublished), and Gray Zone (potentially published but unlinked). To enhance reliability, title and author similarity scores were calculated, and a human-annotated subset of 299 records was created for evaluation of Gray Zone cases. The dataset supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and their published versions. The dataset is publicly available in CSV format via Zenodo.