SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Scientific experimental data remain underutilized by general-purpose AI systems due to their high heterogeneity, domain specificity, and lack of semantic alignment, thereby impeding closed-loop scientific discovery. This work proposes the β€œAI-Ready Scientific Data” paradigm, extending the concept of AI-readiness from text to multimodal scientific data for the first time. It formally defines the specifications, structure, and compositional principles of such data and introduces SciDataCopilot, an autonomous agent framework that end-to-end interprets scientific intent, fuses multimodal data, and automates data preparation. Evaluations across three heterogeneous scientific domains demonstrate up to a 30-fold efficiency gain over manual workflows, significantly enhancing reusability, transferability, and consistency, thus establishing a foundational data interface for AGI-driven scientific research.

Technology Category

Application Category

πŸ“ Abstract
The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical bottleneck for accelerating closed-loop scientific discovery remains the utilization of raw experimental data. Characterized by extreme heterogeneity, high specificity, and deep domain expertise requirements, raw data possess neither direct semantic alignment with linguistic representations nor structural homogeneity suitable for a unified embedding space. The disconnect prevents the emerging class of Artificial General Intelligence for Science (AGI4S) from effectively interfacing with the physical reality of experimentation. In this work, we extend the text-centric AI-Ready concept to Scientific AI-Ready data paradigm, explicitly formalizing how scientific data is specified, structured, and composed within a computational workflow. To operationalize this idea, we propose SciDataCopilot, an autonomous agentic framework designed to handle data ingestion, scientific intent parsing, and multi-modal integration in a end-to-end manner. By positioning data readiness as a core operational primitive, the framework provides a principled foundation for reusable, transferable systems, enabling the transition toward experiment-driven scientific general intelligence. Extensive evaluations across three heterogeneous scientific domains show that SciDataCopilot improves efficiency, scalability, and consistency over manual pipelines, with up to 30$\times$ speedup in data preparation.
Problem

Research questions and friction points this paper is trying to address.

raw experimental data
heterogeneity
semantic alignment
AGI4S
data readiness
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-Ready data
agentic framework
scientific data preparation
Artificial General Intelligence for Science
data readiness
πŸ”Ž Similar Papers
No similar papers found.
J
Jiyong Rao
Shanghai Artificial Intelligence Laboratory
Y
Yicheng Qiu
Shanghai Artificial Intelligence Laboratory
J
Jiahui Zhang
Shanghai Artificial Intelligence Laboratory
J
Juntao Deng
Shanghai Artificial Intelligence Laboratory
Shangquan Sun
Shangquan Sun
University of Chinese Academy of Sciences
Computer VisionMachine Learning
Fenghua Ling
Fenghua Ling
Shanghai Artificial Intelligence Laboratory
AI4ClimateClimate predictionWeather prediction
Hao Chen
Hao Chen
Shanghai AI Lab
AI4EarthMultimodalRemote Sensing
Nanqing Dong
Nanqing Dong
Shanghai Artificial Intelligence Laboratory; University of Oxford
Machine LearningComputer VisionOptimizationAI for Science
Z
Zhangyang Gao
Shanghai Artificial Intelligence Laboratory
Siqi Sun
Siqi Sun
Associate Professor; Fudan University, Shanghai AI Lab
deep learningAI for Science
Yuqiang Li
Yuqiang Li
Central South University
Internal Combustion EngineCombustionEmissionsMechansim
Dongzhan Zhou
Dongzhan Zhou
Researcher at Shanghai AI Lab
AI4Sciencecomputer visiondeep learning
Guangyu Wang
Guangyu Wang
Houston Methodist
BioinformaticsComputational biologyAIepigenetics
Lijun Wu
Lijun Wu
Shanghai AI Laboratory
MLLLMAI4Science
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Xuhong Wang
Xuhong Wang
Shanghai Artificial Intelligence Laboratory
LLMKnowledge SystemAI Simulation
Jing Shao
Jing Shao
Research Scientist, Shanghai AI Laboratory/Shanghai Jiao Tong University
Computer VisionMulti-Modal Large Language Model
X
Xiang Liu
Shanghai Artificial Intelligence Laboratory
Y
Yu Zhu
Shanghai Artificial Intelligence Laboratory
M
Mianxin Liu
Shanghai Artificial Intelligence Laboratory
Qihao Zheng
Qihao Zheng
Shanghai AI Lab
NeuroscienceNeuroAIAI4NeuroAI4Science
Yinghui Zhang
Yinghui Zhang
XUPT & SMU
Public Key CryptographyCloud SecurityNetwork Security
Jiamin Wu
Jiamin Wu
The Chinese University of Hong Kong, Shanghai AI Lab
Computer VisionFew-Shot learningAI4Science
Xiaosong Wang
Xiaosong Wang
Shanghai AI Laboratory
Medical Image AnalysisComputer VisionVision and Language
S
Shixiang Tang
Shanghai Artificial Intelligence Laboratory