Curriculum-Augmented GFlowNets For mRNA Sequence Generation

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

mRNA sequence design faces the challenge of jointly optimizing multiple objectives—stability, translational efficiency, and protein expression—within a high-dimensional combinatorial space. To address this, we propose a curriculum-learning-enhanced Generative Flow Network (GFlowNet) framework. First, we introduce a length-incremental curriculum strategy enabling progressive modeling from short to full-length sequences. Second, we construct the first GFlowNet-based biological environment tailored for mRNA design, supporting multi-objective Pareto optimization and out-of-distribution generalization. Third, we incorporate length-adaptive sampling and a target-protein-guided inverse generation mechanism. Experiments on multiple benchmarks demonstrate that our method significantly improves Pareto front quality, biophysical plausibility, and sequence diversity; achieves faster convergence; and exhibits strong generalization to unseen sequence patterns.

Technology Category

Application Category

📝 Abstract

Designing mRNA sequences is a major challenge in developing next-generation therapeutics, since it involves exploring a vast space of possible nucleotide combinations while optimizing sequence properties like stability, translation efficiency, and protein expression. While Generative Flow Networks are promising for this task, their training is hindered by sparse, long-horizon rewards and multi-objective trade-offs. We propose Curriculum-Augmented GFlowNets (CAGFN), which integrate curriculum learning with multi-objective GFlowNets to generate de novo mRNA sequences. CAGFN integrates a length-based curriculum that progressively adapts the maximum sequence length guiding exploration from easier to harder subproblems. We also provide a new mRNA design environment for GFlowNets which, given a target protein sequence and a combination of biological objectives, allows for the training of models that generate plausible mRNA candidates. This provides a biologically motivated setting for applying and advancing GFlowNets in therapeutic sequence design. On different mRNA design tasks, CAGFN improves Pareto performance and biological plausibility, while maintaining diversity. Moreover, CAGFN reaches higher-quality solutions faster than a GFlowNet trained with random sequence sampling (no curriculum), and enables generalization to out-of-distribution sequences.

Problem

Research questions and friction points this paper is trying to address.

Optimizing mRNA sequence properties like stability and translation efficiency

Addressing sparse rewards and multi-objective trade-offs in sequence generation

Generating diverse, biologically plausible mRNA sequences for therapeutic applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates curriculum learning with multi-objective GFlowNets

Uses length-based curriculum for progressive sequence exploration

Provides mRNA design environment for biological sequence generation

🔎 Similar Papers

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning