🤖 AI Summary
This study investigates the interdisciplinary evolution of scientific knowledge to address global challenges such as pandemics, climate change, and AI ethics. To this end, we construct SciEvo—a large-scale, 30-year bibliometric dataset comprising 2 million scholarly articles—uniquely integrating full-text content with a complete temporal citation network, enabling analysis of term evolution, citation dynamics, and cross-domain knowledge flow. Methodologically, we propose a systematic framework for metadata cleaning, structured text extraction, and time-sliced citation network modeling, publicly released on GitHub, Kaggle, and Hugging Face. Empirical analysis reveals structural disparities across disciplines in knowledge production tempo (e.g., mean citation age: 2.48 years in LLM research vs. 9.71 years in oral history), terminology evolution rates, and depth of knowledge integration. The work delivers a reproducible, scalable benchmark dataset and analytical paradigm for science policy, interdisciplinary evaluation, and AI-augmented research.
📝 Abstract
Understanding the creation, evolution, and dissemination of scientific knowledge is crucial for bridging diverse subject areas and addressing complex global challenges such as pandemics, climate change, and ethical AI. Scientometrics, the quantitative and qualitative study of scientific literature, provides valuable insights into these processes. We introduce SciEvo, a longitudinal scientometric dataset with over two million academic publications, providing comprehensive contents information and citation graphs to support cross-disciplinary analyses. SciEvo is easy to use and available across platforms, including GitHub, Kaggle, and HuggingFace. Using SciEvo, we conduct a temporal study spanning over 30 years to explore key questions in scientometrics: the evolution of academic terminology, citation patterns, and interdisciplinary knowledge exchange. Our findings reveal critical insights, such as disparities in epistemic cultures, knowledge production modes, and citation practices. For example, rapidly developing, application-driven fields like LLMs exhibit significantly shorter citation age (2.48 years) compared to traditional theoretical disciplines like oral history (9.71 years). Our data and analytic tools can be accessed at https://github.com/Ahren09/SciEvo.