Predicting function of evolutionarily implausible DNA sequences

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the capacity of genomic language models (gLMs) to predict functional impacts of evolutionarily implausible DNA sequences—particularly inactivation mutations arising from regulatory element translocations in synthetic expression cassettes. Method: We introduce Nullsettes, a novel benchmark task designed to systematically evaluate 12 state-of-the-art gLMs under both regression and classification paradigms for mutational effect prediction. Contribution/Results: We discover a significant positive correlation between mutation effect prediction accuracy and the log-likelihood of the wild-type sequence—yet this relationship is strongly length-dependent: high likelihood does not guarantee high functional predictability. Crucially, we identify sequence-length-specific likelihood thresholds above which predictive performance sharply improves—establishing the first empirically grounded, quantifiable confidence criteria for gLM-based functional DNA design. Our work moves beyond sole reliance on sequence likelihood as a proxy for functionality, enabling more rigorous, interpretable, and actionable use of gLMs in synthetic genomics.

Technology Category

Application Category

📝 Abstract
Genomic language models (gLMs) show potential for generating novel, functional DNA sequences for synthetic biology, but doing so requires them to learn not just evolutionary plausibility, but also sequence-to-function relationships. We introduce a set of prediction tasks called Nullsettes, which assesses a model's ability to predict loss-of-function mutations created by translocating key control elements in synthetic expression cassettes. Across 12 state-of-the-art models, we find that mutation effect prediction performance strongly correlates with the predicted likelihood of the nonmutant. Furthermore, the range of likelihood values predictive of strong model performance is highly dependent on sequence length. Our work highlights the importance of considering both sequence likelihood and sequence length when using gLMs for mutation effect prediction.
Problem

Research questions and friction points this paper is trying to address.

Assessing gLMs' ability to predict loss-of-function mutations
Evaluating correlation between mutation effect and sequence likelihood
Exploring impact of sequence length on mutation prediction accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses genomic language models for DNA sequences
Introduces Nullsettes for mutation prediction
Considers sequence likelihood and length
🔎 Similar Papers
No similar papers found.