AGP: A Novel Arabidopsis thaliana Genomics-Phenomics Dataset and its HyperGraph Baseline Benchmarking

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Genotype-to-phenotype (G2P) mapping is hindered by data fragmentation and cross-modal heterogeneity: existing resources typically store genomic and phenotypic data separately, lacking multi-modal alignment from the same biological samples. To address this, we introduce AGP, the first integrated multi-modal dataset for *Arabidopsis thaliana*, comprising paired RNA-seq gene expression profiles and high-dimensional, heterogeneous phenotypic traits from identical accessions. We further propose a biologically inspired hypergraph neural network that explicitly models higher-order gene cooperativity and cross-modal associations. Compared to conventional regression and standard graph-based models, our approach achieves significant improvements in both phenotypic prediction accuracy and interpretable association discovery. This work establishes a standardized, multi-modal benchmark for G2P research and provides a scalable, biologically grounded modeling paradigm—advancing functional genomics and intelligent crop breeding.

Technology Category

Application Category

📝 Abstract
Understanding which genes control which traits in an organism remains one of the central challenges in biology. Despite significant advances in data collection technology, our ability to map genes to traits is still limited. This genome-to-phenome (G2P) challenge spans several problem domains, including plant breeding, and requires models capable of reasoning over high-dimensional, heterogeneous, and biologically structured data. Currently, however, many datasets solely capture genetic information or solely capture phenotype information. Additionally, phenotype data is very heterogeneous, which many datasets do not fully capture. The critical drawback is that these datasets are not integrated, that is, they do not link with each other to describe the same biological specimens. This limits machine learning models' ability to be informed on the various aspects of these specimens, impacting the breadth of correlations learned, and therefore their ability to make more accurate predictions. To address this gap, we present the Arabidopsis Genomics-Phenomics (AGP) Dataset, a curated multi-modal dataset linking gene expression profiles with phenotypic trait measurements in Arabidopsis thaliana, a model organism in plant biology. AGP supports tasks such as phenotype prediction and interpretable graph learning. In addition, we benchmark conventional regression and explanatory baselines, including a biologically-informed hypergraph baseline, to validate gene-trait associations. To the best of our knowledge, this is the first dataset that provides multi-modal gene information and heterogeneous trait or phenotype data for the same Arabidopsis thaliana specimens. With AGP, we aim to foster the research community towards accurately understanding the connection between genotypes and phenotypes using gene information, higher-order gene pairings, and trait data from several sources.
Problem

Research questions and friction points this paper is trying to address.

Mapping genes to traits remains a central challenge in biology
Existing datasets lack integrated genetic and phenotypic information
Models need to handle high-dimensional heterogeneous biological data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated genomics-phenomics dataset for Arabidopsis thaliana
Hypergraph baseline for gene-trait association benchmarking
Multi-modal data linking gene expression with phenotypes
M
Manuel Serna-Aguilera
Department of Electrical Engineering and Computer Science, University of Arkansas, Fayetteville
F
Fiona L. Goggin
Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville
A
Aranyak Goswami
Department of Animal Science, University of Arkansas, Fayetteville
Alexander Bucksch
Alexander Bucksch
University of Arizona, School of Plant Science
Computational Plant Sciencemorphological plant modellingplant shapeplant simulationplant imaging
Suxing Liu
Suxing Liu
Georgia State University
Computer visionmachine learningcomputational plant science3D imaging and reconstruction
Khoa Luu
Khoa Luu
EECS Department, University of Arkansas
Smart HealthBiometricsAutonomous DrivingQuantum Machine LearningPrecision Agriculture