Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Genomic AI model development suffers from fragmented, non-standardized workflows, hindering reproducibility and biological interpretability. To address this, we introduce GenoFlow—a unified, end-to-end Python toolkit supporting data acquisition (including automated sequence downloading and quality control), model fine-tuning (full-parameter, LoRA, and Adapter), inference, benchmarking, and biological interpretation. Its key contribution is the first open-source, sparse autoencoder–based biological feature disentanglement module, which maps model embeddings onto interpretable, functionally annotated genomic dimensions. GenoFlow further provides zero-code CLI and web interfaces for broad accessibility. Experiments demonstrate compatibility with state-of-the-art DNA language models (e.g., DNABERT-2), superior performance on two major open benchmarks, and successful biological interpretation—including GC-content bias and promoter recognition—thereby substantially enhancing both practical utility and mechanistic interpretability of genomic AI.

Technology Category

Application Category

📝 Abstract
We introduce Genome-Factory, an integrated Python library for tuning, deploying, and interpreting genomic models. Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability. For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them. It also includes quality control, such as GC content normalization. For model tuning, Genome-Factory supports three approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning. It is compatible with a wide range of genomic models. For inference, Genome-Factory enables both embedding extraction and DNA sequence generation. For benchmarking, we include two existing benchmarks and provide a flexible interface for users to incorporate additional benchmarks. For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder. This module disentangles embeddings into sparse, near-monosemantic latent units and links them to interpretable genomic features by regressing on external readouts. To improve accessibility, Genome-Factory features both a zero-code command-line interface and a user-friendly web interface. We validate the utility of Genome-Factory across three dimensions: (i) Compatibility with diverse models and fine-tuning methods; (ii) Benchmarking downstream performance using two open-source benchmarks; (iii) Biological interpretation of learned representations with DNABERT-2. These results highlight its end-to-end usability and practical value for real-world genomic analysis.
Problem

Research questions and friction points this paper is trying to address.

Simplifying genomic model development workflow integration
Automating data collection and preprocessing for sequences
Enabling model interpretability with biological feature linking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated genomic data collection and preprocessing pipeline
Supports three model tuning approaches for compatibility
First open-source biological interpreter for sparse embeddings
🔎 Similar Papers
No similar papers found.
Weimin Wu
Weimin Wu
Ph.D. Candidate in Computer Science, Northwestern University
AI for BiologyML Theory
Xuefeng Song
Xuefeng Song
Department of Computer Science, Northwestern University
AI for scienceLarge Language ModelNatural Language Processing
Y
Yibo Wen
Center for Foundation Models and Generative AI, Northwestern University, USA
Qinjie Lin
Qinjie Lin
PhD in computer science, Northwestern University
Robotics systemRobot LearningReinforcement Learning
Z
Zhihan Zhou
Center for Foundation Models and Generative AI, Northwestern University, USA
Jerry Yao-Chieh Hu
Jerry Yao-Chieh Hu
Northwestern University
Machine Learning(* denotes equal contribution)
Z
Zhong Wang
Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, USA
H
Han Liu
Department of Statistics and Data Science, Northwestern University, USA