The Semantic Scholar Open Data Platform

📅 2023
🏛️ arXiv.org
📈 Citations: 87
Influential: 12
📄 PDF
🤖 AI Summary
Amidst the exponential growth of scientific literature, researchers urgently require efficient tools for literature understanding and discovery. This paper introduces Semantic Scholar’s open academic knowledge graph construction paradigm: a novel, fully automated pipeline integrating multi-source data, high-precision PDF parsing, fine-grained structured semantic annotation, NLP-driven natural language summarization, and context-aware embedding representation learning. The resulting open academic graph—the largest to date—comprises over 200 million papers, 80 million authors, and 2.4 billion citations, hosted on a dynamically updatable “living document”–style platform architecture. We publicly release the Semantic Scholar Academic Graph alongside standardized APIs, establishing it as a globally adopted open research infrastructure. This framework significantly enhances the efficiency and effectiveness of scholarly information retrieval, comprehension, and knowledge synthesis.
📝 Abstract
The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-theart techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date, with 200M+ papers, 80M+ authors, 550M+ paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings. In this paper, we describe the components of the S2 data processing pipeline and the associated APIs offered by the platform. We will update this living document to reflect changes as we add new data offerings and improve existing services.
Problem

Research questions and friction points this paper is trying to address.

Automated tools needed to manage growing scientific literature volume
Semantic Scholar accelerates science by enhancing literature discovery
Building large open academic graph with advanced semantic features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated scholarly PDF content extraction
Automatic knowledge graph construction
Advanced semantic features integration
🔎 Similar Papers
No similar papers found.
R
Rodney Michael Kinney
Allen Institute for Artificial Intelligence
C
Chloe Anastasiades
Allen Institute for Artificial Intelligence
R
Russell Authur
Allen Institute for Artificial Intelligence
Iz Beltagy
Iz Beltagy
Allen Institute for Artificial Intelligence
Jonathan Bragg
Jonathan Bragg
Allen Institute for AI (AI2)
Artificial IntelligenceHuman-Computer InteractionCrowdsourcing
A
Alexandra Buraczynski
Allen Institute for Artificial Intelligence
I
Isabel Cachola
Allen Institute for Artificial Intelligence
S
Stefan Candra
Allen Institute for Artificial Intelligence
Y
Yoganand Chandrasekhar
Allen Institute for Artificial Intelligence
Arman Cohan
Arman Cohan
Yale University; Allen Institute for AI
Natural Language ProcessingMachine LearningArtificial Intelligence
Miles Crawford
Miles Crawford
Allen Institute for Artificial Intelligence
D
Doug Downey
Allen Institute for Artificial Intelligence
Jason Dunkelberger
Jason Dunkelberger
Semantic Scholar
Oren Etzioni
Oren Etzioni
University of Washington
AI
R
Rob Evans
Allen Institute for Artificial Intelligence
Sergey Feldman
Sergey Feldman
Allen Institute of Artificial Intelligence, Alongside Care
Machine LearningEstimationPattern Recognition
J
Joseph Gorney
Allen Institute for Artificial Intelligence
D
D. Graham
Allen Institute for Artificial Intelligence
F
F.Q. Hu
Allen Institute for Artificial Intelligence
R
Regan Huff
Allen Institute for Artificial Intelligence
D
Daniel King
Allen Institute for Artificial Intelligence
S
Sebastian Kohlmeier
Allen Institute for Artificial Intelligence
Bailey Kuehl
Bailey Kuehl
Allen Institute for AI
M
Michael Langan
Allen Institute for Artificial Intelligence
D
Daniel Lin
Allen Institute for Artificial Intelligence
Haokun Liu
Haokun Liu
Vector Institute, University of Toronto
Natural Language Processing
Kyle Lo
Kyle Lo
Allen Institute for AI
natural language processingmachine learninghuman computer interactionstatistics
J
Jaron Lochner
Allen Institute for Artificial Intelligence
K
Kelsey MacMillan
Allen Institute for Artificial Intelligence
T
Tyler Murray
Allen Institute for Artificial Intelligence
C
Christopher Newell
Allen Institute for Artificial Intelligence
S
Smita R Rao
Allen Institute for Artificial Intelligence
Shaurya Rohatgi
Shaurya Rohatgi
IFM, MBZUAI
Machine LearningNLPInformation Retrieval
P
Paul Sayre
Allen Institute for Artificial Intelligence
Z
Zejiang Shen
Allen Institute for Artificial Intelligence
A
Amanpreet Singh
Allen Institute for Artificial Intelligence
Luca Soldaini
Luca Soldaini
Allen Institute for AI
Large Language ModelsOpen Source AIInformation Retrieval
Shivashankar Subramanian
Shivashankar Subramanian
Allen Institute for Artificial Intelligence
A
A. Tanaka
Allen Institute for Artificial Intelligence
A
Alex D Wade
Allen Institute for Artificial Intelligence
L
Linda M. Wagner
Allen Institute for Artificial Intelligence
Lucy Lu Wang
Lucy Lu Wang
University of Washington; Allen Institute for AI (Ai2)
health informaticsnatural language processingscience communicationopen access
C
Christopher Wilhelm
Allen Institute for Artificial Intelligence
C
Caroline Wu
Allen Institute for Artificial Intelligence
J
Jiangjiang Yang
Allen Institute for Artificial Intelligence
A
Angele Zamarron
Allen Institute for Artificial Intelligence
Madeleine van Zuylen
Madeleine van Zuylen
Allen Institute for Artificial Intelligence
D
Daniel S. Weld
Allen Institute for Artificial Intelligence