DeepDiveAI: Identifying AI Related Documents in Large Scale Literature Data

๐Ÿ“… 2024-08-23
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Accurately identifying AI-related scholarly publications across disciplines and over extended time periods (1956โ€“2024) remains challenging due to heterogeneous terminology, evolving definitions, and sparse domain-specific labeled data. Method: We introduce the first dedicated annotation dataset for AI literature identification and propose an LSTM-based binary classifier that jointly encodes metadata (e.g., venue, year, authors) and textual features from titles and abstracts. The model is explicitly designed for temporal robustness and cross-domain generalizability. Contribution/Results: Our approach enables scalable, high-precision filtering of AI literature at scale, yielding DeepDiveAIโ€”the largest publicly available AIๆ–‡็Œฎ dataset to date, comprising 9.4 million papers. Rigorous evaluation demonstrates >96% accuracy and F1-score. DeepDiveAI provides a foundational, high-quality resource for AI historiography, disciplinary evolution analysis, and evidence-based science policy research.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper presents DeepDiveAI, a comprehensive dataset specifically curated to identify AI-related research papers from a large-scale academic literature database. The dataset was created using an advanced Long Short-Term Memory (LSTM) model trained on a binary classification task to distinguish between AI-related and non-AI-related papers. The model was trained and validated on a vast dataset, achieving high accuracy, precision, recall, and F1-score. The resulting DeepDelveAI dataset comprises over 9.4 million AI-related papers published since Dartmouth Conference, from 1956 to 2024, providing a crucial resource for analyzing trends, thematic developments, and the evolution of AI research across various disciplines.
Problem

Research questions and friction points this paper is trying to address.

Automatically classify AI-related documents from large-scale literature databases
Create an AI-related literature dataset named DeepDiveAI
Integrate expert knowledge with advanced models for accurate classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

LSTM model classifies AI-related records
Qwen2.5 Plus annotates coarse AI records
BERT classifier refines final AI dataset
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xiaochen Zhou
The University of Hong Kong, Shanghai Artificial Intelligence laboratory
X
Xingzhou Liang
Shanghai Artificial Intelligence laboratory
Z
Zou Hui
Shanghai University
Lu Yi
Lu Yi
Renmin University of China
Graph algorithmGraph Neural NetworkDynamic graph
J
Jingjing Qu
Shanghai Artificial Intelligence laboratory