Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

πŸ“… 2026-03-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes the first unified, scalable, and domain-agnostic automated framework to address the time-consuming and error-prone nature of manual collection and curation of scientific literature. By integrating keyword-based retrieval, parallel harvesting from multiple API sources, and domain-specific filtering via customized large language model (LLM) prompts, the framework efficiently constructs open scientific databases. Evaluated on agricultural and crop yield tasks, the generated database achieves a 90% overlap with a small expert-curated benchmark, demonstrating high fidelity while substantially reducing manual effort. These results validate the framework’s effectiveness and practical utility for accelerating open science initiatives across diverse research domains.

Technology Category

Application Category

πŸ“ Abstract
With the exponential increase in online scientific literature, identifying reliable domain-specific data has become increasingly important but also very challenging. Manual data collection and filtering for domain-specific scientific literature is not only time-consuming but also labor-intensive and prone to errors and inconsistencies. To facilitate automated data collection, the paper introduces a web-based tool that leverages Large Language Models (LLMs) for automated and scalable development of open scientific databases. More specifically, the tool is based on an automated and unified framework that combines keyword-based querying, API-enabled data retrieval, and LLM-powered text classification to construct domain-specific scientific databases. Data is collected from multiple reliable data sources and search engines using a parallel querying technique to construct a combined unified dataset. The dataset is subsequently filtered using LLMs queried with prompts tailored for each keyword-based query to extract the relevant data to a scientific query of interest. The approach was tested across a set of variable keyword-based searches for different domain-specific tasks related to agriculture and crop yield. The results and analysis show 90\% overlap with small domain expert-curated databases, suggesting that the proposed tool can be used to significantly reduce manual workload. Furthermore, the proposed framework is both scalable and domain-agnostic and can be applied across diverse fields for building scalable open scientific databases.
Problem

Research questions and friction points this paper is trying to address.

scientific literature
domain-specific data
automated data collection
open scientific databases
data extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Automated Database Construction
Domain-Specific Data Filtering
Scalable Scientific Databases
Prompt-Based Text Classification
πŸ”Ž Similar Papers
No similar papers found.
N
Nikita Gautam
Department of Computer Science, Kansas State University, Manhattan, KS 66502, USA
Doina Caragea
Doina Caragea
Kansas State University
deep learningtext miningdata miningdata science
I
Ignacio Ciampitti
Department of Agronomy, Purdue University, West Lafayette, IN 47907, USA
F
Federico Gomez
Department of Agronomy, Purdue University, West Lafayette, IN 47907, USA