ScienceDB AI: An LLM-Driven Agentic Recommender System for Large-Scale Scientific Data Sharing Services

📅 2026-01-03
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Traditional recommender systems struggle to accurately interpret researchers’ data needs within complex scientific contexts, hindering efficient sharing and reuse of scientific data. This work proposes a conversational agent-based recommendation system powered by large language models, which achieves deep understanding of scientific intent and dynamic data recommendation through a scientific intent感知器, a structured memory compressor, and a trustworthy retrieval-augmented generation (RAG) framework. The system introduces the CSTR (Citable Scientific Task Record) identifier mechanism to ensure that recommended results are citable and reproducible. Experiments on a real-world scientific dataset comprising over ten million entries demonstrate significant improvements in recommendation accuracy and user satisfaction. The system has been deployed as a publicly accessible service.

Technology Category

Application Category

📝 Abstract
The rapid growth of AI for Science (AI4S) has underscored the significance of scientific datasets, leading to the establishment of numerous national scientific data centers and sharing platforms. Despite this progress, efficiently promoting dataset sharing and utilization for scientific research remains challenging. Scientific datasets contain intricate domain-specific knowledge and contexts, rendering traditional collaborative filtering-based recommenders inadequate. Recent advances in Large Language Models (LLMs) offer unprecedented opportunities to build conversational agents capable of deep semantic understanding and personalized recommendations. In response, we present ScienceDB AI, a novel LLM-driven agentic recommender system developed on Science Data Bank (ScienceDB), one of the largest global scientific data-sharing platforms. ScienceDB AI leverages natural language conversations and deep reasoning to accurately recommend datasets aligned with researchers'scientific intents and evolving requirements. The system introduces several innovations: a Scientific Intention Perceptor to extract structured experimental elements from complicated queries, a Structured Memory Compressor to manage multi-turn dialogues effectively, and a Trustworthy Retrieval-Augmented Generation (Trustworthy RAG) framework. The Trustworthy RAG employs a two-stage retrieval mechanism and provides citable dataset references via Citable Scientific Task Record (CSTR) identifiers, enhancing recommendation trustworthiness and reproducibility. Through extensive offline and online experiments using over 10 million real-world datasets, ScienceDB AI has demonstrated significant effectiveness. To our knowledge, ScienceDB AI is the first LLM-driven conversational recommender tailored explicitly for large-scale scientific dataset sharing services. The platform is publicly accessible at: https://ai.scidb.cn/en.
Problem

Research questions and friction points this paper is trying to address.

scientific data sharing
dataset recommendation
AI for Science
large-scale data services
personalized recommendation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven recommender
Scientific Intention Perceptor
Structured Memory Compressor
Trustworthy RAG
Citable Scientific Task Record
🔎 Similar Papers
No similar papers found.
Q
Qingqing Long
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
Haotian Chen
Haotian Chen
University of California, Los Angeles
Political EconomyNon-market StrategyAmerican Politics
C
Chenyang Zhao
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
X
Xiaolei Du
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
Xuezhi Wang
Xuezhi Wang
Research Scientist, Google DeepMind
Machine LearningNatural Language Processing
P
Pengyao Wang
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
C
Chengzan Li
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
Yuanchun Zhou
Yuanchun Zhou
Computer Network Information Center,CAS
Data MiningBig Data Analysis
H
Hengshu Zhu
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China