When AI Does Science: Evaluating the Autonomous AI Scientist KOSMOS in Radiation Biology

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the need for rigorous evaluation of autonomous AI scientists in real-world biomedical discovery, specifically assessing KOSMOS’s capability to generate and validate testable hypotheses in radiobiology. Method: We systematically evaluated three biologically grounded, falsifiable hypotheses using an LLM-driven pipeline encompassing literature-based hypothesis generation, multi-tiered statistical validation (Spearman correlation, empirical p-values, C-index), and a novel stringent negative-control paradigm employing randomized gene models. Contribution/Results: KOSMOS successfully identified statistically significant associations: baseline CDO1 expression positively correlated with radiation-response module strength (r = 0.70, p = 0.0039); and a 12-gene expression signature demonstrated moderate but significant predictive performance for biochemical recurrence-free survival following radiotherapy plus androgen deprivation therapy in prostate cancer (C-index = 0.61, p = 0.017). The study establishes the feasibility and methodological standards for end-to-end scientific discovery by AI scientists in translational radiobiology.

Technology Category

Application Category

📝 Abstract
Agentic AI "scientists" now use language models to search the literature, run analyses, and generate hypotheses. We evaluate KOSMOS, an autonomous AI scientist, on three problems in radiation biology using simple random-gene null benchmarks. Hypothesis 1: baseline DNA damage response (DDR) capacity across cell lines predicts the p53 transcriptional response after irradiation (GSE30240). Hypothesis 2: baseline expression of OGT and CDO1 predicts the strength of repressed and induced radiation-response modules in breast cancer cells (GSE59732). Hypothesis 3: a 12-gene expression signature predicts biochemical recurrence-free survival after prostate radiotherapy plus androgen deprivation therapy (GSE116918). The DDR-p53 hypothesis was not supported: DDR score and p53 response were weakly negatively correlated (Spearman rho = -0.40, p = 0.76), indistinguishable from random five-gene scores. OGT showed only a weak association (r = 0.23, p = 0.34), whereas CDO1 was a clear outlier (r = 0.70, empirical p = 0.0039). The 12-gene signature achieved a concordance index of 0.61 (p = 0.017) but a non-unique effect size. Overall, KOSMOS produced one well-supported discovery, one plausible but uncertain result, and one false hypothesis, illustrating that AI scientists can generate useful ideas but require rigorous auditing against appropriate null models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating autonomous AI scientist KOSMOS in radiation biology research
Testing AI-generated hypotheses about DNA damage response and gene expression
Assessing AI's ability to predict radiation therapy outcomes using biomarkers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous AI scientist uses language models
Evaluates hypotheses using gene expression data
Tests predictions against random-gene null benchmarks
Humza Nusrat
Humza Nusrat
Henry Ford Health
Monte CarloAIVR/ARFLASH
O
Omar Nusrat
Toronto Metropolitan University, Toronto, Canada