🤖 AI Summary
Genomic AI model development suffers from fragmented, non-standardized workflows, hindering reproducibility and biological interpretability. To address this, we introduce GenoFlow—a unified, end-to-end Python toolkit supporting data acquisition (including automated sequence downloading and quality control), model fine-tuning (full-parameter, LoRA, and Adapter), inference, benchmarking, and biological interpretation. Its key contribution is the first open-source, sparse autoencoder–based biological feature disentanglement module, which maps model embeddings onto interpretable, functionally annotated genomic dimensions. GenoFlow further provides zero-code CLI and web interfaces for broad accessibility. Experiments demonstrate compatibility with state-of-the-art DNA language models (e.g., DNABERT-2), superior performance on two major open benchmarks, and successful biological interpretation—including GC-content bias and promoter recognition—thereby substantially enhancing both practical utility and mechanistic interpretability of genomic AI.
📝 Abstract
We introduce Genome-Factory, an integrated Python library for tuning, deploying, and interpreting genomic models. Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability. For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them. It also includes quality control, such as GC content normalization. For model tuning, Genome-Factory supports three approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning. It is compatible with a wide range of genomic models. For inference, Genome-Factory enables both embedding extraction and DNA sequence generation. For benchmarking, we include two existing benchmarks and provide a flexible interface for users to incorporate additional benchmarks. For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder. This module disentangles embeddings into sparse, near-monosemantic latent units and links them to interpretable genomic features by regressing on external readouts. To improve accessibility, Genome-Factory features both a zero-code command-line interface and a user-friendly web interface. We validate the utility of Genome-Factory across three dimensions: (i) Compatibility with diverse models and fine-tuning methods; (ii) Benchmarking downstream performance using two open-source benchmarks; (iii) Biological interpretation of learned representations with DNABERT-2. These results highlight its end-to-end usability and practical value for real-world genomic analysis.