PreprintToPaper dataset: connecting bioRxiv preprints with journal publications

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This study systematically investigates the full lifecycle of preprint-to-journal publication transitions on bioRxiv. It addresses the challenge of ambiguous or unverifiable publication statuses by proposing the first three-category classification framework: “Published” (peer-reviewed journal articles), “Preprint-only,” and “Gray Zone” (e.g., retracted, untraceable, or status-ambiguous cases). Leveraging metadata from Crossref and the bioRxiv API, the authors integrate title/author textual similarity matching with expert human annotation to construct a large-scale, high-accuracy preprint–journal mapping dataset comprising 145,000 records. Its key contribution lies in the first formal definition and empirical identification of the “Gray Zone,” substantially improving linkage precision. The publicly released, open-source CSV dataset enables diverse research applications—including open science evaluation, scholarly communication analysis, and training of NLP models for academic text mining.

Technology Category

Application Category

📝 Abstract

The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process. It comprises metadata for 145,517 preprints from two periods, 2016-2018 (pre-pandemic) and 2020-2022 (pandemic), retrieved via the bioRxiv and Crossref APIs. Each record includes bibliographic information such as titles, abstracts, authors, institutions, submission dates, licenses, and subject categories, alongside enriched publication metadata including journal names, publication dates, author lists, and further information. Preprints are categorized into three groups: Published (formally linked to a journal article), Preprint Only (unpublished), and Gray Zone (potentially published but unlinked). To enhance reliability, title and author similarity scores were calculated, and a human-annotated subset of 299 records was created for evaluation of Gray Zone cases. The dataset supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and their published versions. The dataset is publicly available in CSV format via Zenodo.

Problem

Research questions and friction points this paper is trying to address.

Linking bioRxiv preprints with corresponding journal publications

Enabling large-scale analysis of preprint-to-publication process

Categorizing preprints into published, unpublished and unlinked groups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Links preprints to publications via APIs

Categorizes records into three publication statuses

Calculates similarity scores for reliability enhancement

🔎 Similar Papers

PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science