PreprintToPaper dataset: connecting bioRxiv preprints with journal publications

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

166K/year
🤖 AI Summary
This study systematically investigates the full lifecycle of preprint-to-journal publication transitions on bioRxiv. It addresses the challenge of ambiguous or unverifiable publication statuses by proposing the first three-category classification framework: “Published” (peer-reviewed journal articles), “Preprint-only,” and “Gray Zone” (e.g., retracted, untraceable, or status-ambiguous cases). Leveraging metadata from Crossref and the bioRxiv API, the authors integrate title/author textual similarity matching with expert human annotation to construct a large-scale, high-accuracy preprint–journal mapping dataset comprising 145,000 records. Its key contribution lies in the first formal definition and empirical identification of the “Gray Zone,” substantially improving linkage precision. The publicly released, open-source CSV dataset enables diverse research applications—including open science evaluation, scholarly communication analysis, and training of NLP models for academic text mining.

Technology Category

Application Category

📝 Abstract
The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process. It comprises metadata for 145,517 preprints from two periods, 2016-2018 (pre-pandemic) and 2020-2022 (pandemic), retrieved via the bioRxiv and Crossref APIs. Each record includes bibliographic information such as titles, abstracts, authors, institutions, submission dates, licenses, and subject categories, alongside enriched publication metadata including journal names, publication dates, author lists, and further information. Preprints are categorized into three groups: Published (formally linked to a journal article), Preprint Only (unpublished), and Gray Zone (potentially published but unlinked). To enhance reliability, title and author similarity scores were calculated, and a human-annotated subset of 299 records was created for evaluation of Gray Zone cases. The dataset supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and their published versions. The dataset is publicly available in CSV format via Zenodo.
Problem

Research questions and friction points this paper is trying to address.

Linking bioRxiv preprints with corresponding journal publications
Enabling large-scale analysis of preprint-to-publication process
Categorizing preprints into published, unpublished and unlinked groups
Innovation

Methods, ideas, or system contributions that make the work stand out.

Links preprints to publications via APIs
Categorizes records into three publication statuses
Calculates similarity scores for reliability enhancement
💼 Related Jobs
Postdoctoral Fellow – AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid