Automating Exploratory Multiomics Research via Language Models

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In clinical proteogenomics, converting raw multi-omics data into reliable, novel biological hypotheses remains a major challenge due to the lack of automated, interpretable frameworks. Method: We propose PROTEUS—the first fully automated hypothesis generation framework that uniformly models the scientific discovery process as an evolvable, interpretable research process graph. It integrates large language models, modular workflow simulation, graph neural network–based representation learning, and an automatic open-scoring mechanism to enable end-to-end analysis of heterogeneous high-throughput data. Contribution/Results: PROTEUS unifies exploratory analysis, statistical testing, and iterative hypothesis generation within a single graph structure, supporting open-science–driven autonomous discovery. Evaluated on 10 public clinical multi-omics datasets, it generated 360 hypotheses; external validation and automated assessment demonstrated significant improvement in the reliability–novelty trade-off. This advances general-purpose AI toward domain-specialized scientific discovery systems.

Technology Category

Application Category

📝 Abstract
This paper introduces PROTEUS, a fully automated system that produces data-driven hypotheses from raw data files. We apply PROTEUS to clinical proteogenomics, a field where effective downstream data analysis and hypothesis proposal is crucial for producing novel discoveries. PROTEUS uses separate modules to simulate different stages of the scientific process, from open-ended data exploration to specific statistical analysis and hypothesis proposal. It formulates research directions, tools, and results in terms of relationships between biological entities, using unified graph structures to manage complex research processes. We applied PROTEUS to 10 clinical multiomics datasets from published research, arriving at 360 total hypotheses. Results were evaluated through external data validation and automatic open-ended scoring. Through exploratory and iterative research, the system can navigate high-throughput and heterogeneous multiomics data to arrive at hypotheses that balance reliability and novelty. In addition to accelerating multiomic analysis, PROTEUS represents a path towards tailoring general autonomous systems to specialized scientific domains to achieve open-ended hypothesis generation from data.
Problem

Research questions and friction points this paper is trying to address.

Automating hypothesis generation from multiomics data
Enhancing clinical proteogenomics analysis efficiency
Balancing reliability and novelty in hypotheses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated system for data-driven hypothesis generation
Modular approach simulating scientific research stages
Unified graph structures manage complex biological relationships
🔎 Similar Papers
No similar papers found.
Shang Qu
Shang Qu
Tsinghua University
AI4Bio
N
Ning Ding
Tsinghua University, Shanghai Artificial Intelligence Laboratory
Linhai Xie
Linhai Xie
National Center for Protein Science - Beijing
deep learningreinforcement learningroboticsproteomics
Y
Yifei Li
Tsinghua University, National Center for Protein Sciences (Beijing), State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center
Z
Zaoqu Liu
National Center for Protein Sciences (Beijing), State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center
Kaiyan Zhang
Kaiyan Zhang
Tsinghua University
Foundation ModelCollective IntelligenceScientific Intelligence
Y
Yibai Xiong
Tsinghua University, Frontis AI
Y
Yuxin Zuo
Tsinghua University
Z
Zhangren Chen
Frontis AI
Ermo Hua
Ermo Hua
Tsinghua University
Physics-driven Foundation Model
Xingtai Lv
Xingtai Lv
Tsinghua University
Large Language ModelNatural Language Processing
Youbang Sun
Youbang Sun
Assistant Researcher, Tsinghua University; Northeastern University; Texas A&M University
Distributed OptimizationMulti-Agent RLRiemannian OptimizationFederated Learning
Y
Yang Li
National Center for Protein Sciences (Beijing), State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center
D
Dong Li
National Center for Protein Sciences (Beijing), State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center
F
Fuchu He
National Center for Protein Sciences (Beijing), State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, International Academy of Phronesis Medicine (Guangdong)
B
Bowen Zhou
Tsinghua University, Shanghai Artificial Intelligence Laboratory