Agentomics-ML: Autonomous Machine Learning Experimentation Agent for Genomic and Transcriptomic Data

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the poor generalization and low success rates of existing LLM-driven AutoML systems on heterogeneous, high-dimensional bioinformatics data (e.g., genomic/transcriptomic datasets), this paper introduces the first autonomous ML experimentation agent that synergistically integrates Bash filesystem interaction with large language models. The system enables end-to-end closed-loop automation—including data preprocessing, model selection, training, evaluation, and reflection-based optimization guided by performance metrics. It pioneers deep integration of LLM reasoning with low-level system operations (e.g., file I/O, command execution), enabling dynamic adaptation of data representations, model architectures, and hyperparameters. Evaluated across multiple multi-omics benchmarks, it significantly outperforms state-of-the-art autonomous agents; notably, on one task, it achieves expert-level, hand-tuned SOTA performance—substantially narrowing the performance gap between fully automated systems and human expertise.

Technology Category

Application Category

📝 Abstract

The adoption of machine learning (ML) and deep learning methods has revolutionized molecular medicine by driving breakthroughs in genomics, transcriptomics, drug discovery, and biological systems modeling. The increasing quantity, multimodality, and heterogeneity of biological datasets demand automated methods that can produce generalizable predictive models. Recent developments in large language model-based agents have shown promise for automating end-to-end ML experimentation on structured benchmarks. However, when applied to heterogeneous computational biology datasets, these methods struggle with generalization and success rates. Here, we introduce Agentomics-ML, a fully autonomous agent-based system designed to produce a classification model and the necessary files for reproducible training and inference. Our method follows predefined steps of an ML experimentation process, repeatedly interacting with the file system through Bash to complete individual steps. Once an ML model is produced, training and validation metrics provide scalar feedback to a reflection step to identify issues such as overfitting. This step then creates verbal feedback for future iterations, suggesting adjustments to steps such as data representation, model architecture, and hyperparameter choices. We have evaluated Agentomics-ML on several established genomic and transcriptomic benchmark datasets and show that it outperforms existing state-of-the-art agent-based methods in both generalization and success rates. While state-of-the-art models built by domain experts still lead in absolute performance on the majority of the computational biology datasets used in this work, Agentomics-ML narrows the gap for fully autonomous systems and achieves state-of-the-art performance on one of the used benchmark datasets. The code is available at https://github.com/BioGeMT/Agentomics-ML.

Problem

Research questions and friction points this paper is trying to address.

Automates ML experimentation for genomic and transcriptomic data

Addresses generalization challenges in computational biology datasets

Improves success rates of autonomous agent-based ML systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous agent for ML experimentation

Bash interaction for file system tasks

Reflection step with verbal feedback

🔎 Similar Papers

Large Language Model Agent for Hyper-Parameter Optimization