๐ค AI Summary
Accurately identifying AI-related scholarly publications across disciplines and over extended time periods (1956โ2024) remains challenging due to heterogeneous terminology, evolving definitions, and sparse domain-specific labeled data.
Method: We introduce the first dedicated annotation dataset for AI literature identification and propose an LSTM-based binary classifier that jointly encodes metadata (e.g., venue, year, authors) and textual features from titles and abstracts. The model is explicitly designed for temporal robustness and cross-domain generalizability.
Contribution/Results: Our approach enables scalable, high-precision filtering of AI literature at scale, yielding DeepDiveAIโthe largest publicly available AIๆ็ฎ dataset to date, comprising 9.4 million papers. Rigorous evaluation demonstrates >96% accuracy and F1-score. DeepDiveAI provides a foundational, high-quality resource for AI historiography, disciplinary evolution analysis, and evidence-based science policy research.
๐ Abstract
This paper presents DeepDiveAI, a comprehensive dataset specifically curated to identify AI-related research papers from a large-scale academic literature database. The dataset was created using an advanced Long Short-Term Memory (LSTM) model trained on a binary classification task to distinguish between AI-related and non-AI-related papers. The model was trained and validated on a vast dataset, achieving high accuracy, precision, recall, and F1-score. The resulting DeepDelveAI dataset comprises over 9.4 million AI-related papers published since Dartmouth Conference, from 1956 to 2024, providing a crucial resource for analyzing trends, thematic developments, and the evolution of AI research across various disciplines.