DeepDiveAI: Identifying AI Related Documents in Large Scale Literature Data

📅 2024-08-23

📈 Citations: 1

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Accurately identifying AI-related scholarly publications across disciplines and over extended time periods (1956–2024) remains challenging due to heterogeneous terminology, evolving definitions, and sparse domain-specific labeled data. Method: We introduce the first dedicated annotation dataset for AI literature identification and propose an LSTM-based binary classifier that jointly encodes metadata (e.g., venue, year, authors) and textual features from titles and abstracts. The model is explicitly designed for temporal robustness and cross-domain generalizability. Contribution/Results: Our approach enables scalable, high-precision filtering of AI literature at scale, yielding DeepDiveAI—the largest publicly available AI文献 dataset to date, comprising 9.4 million papers. Rigorous evaluation demonstrates >96% accuracy and F1-score. DeepDiveAI provides a foundational, high-quality resource for AI historiography, disciplinary evolution analysis, and evidence-based science policy research.

Technology Category

Application Category

📝 Abstract

This paper presents DeepDiveAI, a comprehensive dataset specifically curated to identify AI-related research papers from a large-scale academic literature database. The dataset was created using an advanced Long Short-Term Memory (LSTM) model trained on a binary classification task to distinguish between AI-related and non-AI-related papers. The model was trained and validated on a vast dataset, achieving high accuracy, precision, recall, and F1-score. The resulting DeepDelveAI dataset comprises over 9.4 million AI-related papers published since Dartmouth Conference, from 1956 to 2024, providing a crucial resource for analyzing trends, thematic developments, and the evolution of AI research across various disciplines.

Problem

Research questions and friction points this paper is trying to address.

Automatically classify AI-related documents from large-scale literature databases

Create an AI-related literature dataset named DeepDiveAI

Integrate expert knowledge with advanced models for accurate classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

LSTM model classifies AI-related records

Qwen2.5 Plus annotates coarse AI records

BERT classifier refines final AI dataset

🔎 Similar Papers

No similar papers found.