Predicting function of evolutionarily implausible DNA sequences

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study investigates the capacity of genomic language models (gLMs) to predict functional impacts of evolutionarily implausible DNA sequences—particularly inactivation mutations arising from regulatory element translocations in synthetic expression cassettes. Method: We introduce Nullsettes, a novel benchmark task designed to systematically evaluate 12 state-of-the-art gLMs under both regression and classification paradigms for mutational effect prediction. Contribution/Results: We discover a significant positive correlation between mutation effect prediction accuracy and the log-likelihood of the wild-type sequence—yet this relationship is strongly length-dependent: high likelihood does not guarantee high functional predictability. Crucially, we identify sequence-length-specific likelihood thresholds above which predictive performance sharply improves—establishing the first empirically grounded, quantifiable confidence criteria for gLM-based functional DNA design. Our work moves beyond sole reliance on sequence likelihood as a proxy for functionality, enabling more rigorous, interpretable, and actionable use of gLMs in synthetic genomics.

Technology Category

Application Category

📝 Abstract

Genomic language models (gLMs) show potential for generating novel, functional DNA sequences for synthetic biology, but doing so requires them to learn not just evolutionary plausibility, but also sequence-to-function relationships. We introduce a set of prediction tasks called Nullsettes, which assesses a model's ability to predict loss-of-function mutations created by translocating key control elements in synthetic expression cassettes. Across 12 state-of-the-art models, we find that mutation effect prediction performance strongly correlates with the predicted likelihood of the nonmutant. Furthermore, the range of likelihood values predictive of strong model performance is highly dependent on sequence length. Our work highlights the importance of considering both sequence likelihood and sequence length when using gLMs for mutation effect prediction.

Problem

Research questions and friction points this paper is trying to address.

Assessing gLMs' ability to predict loss-of-function mutations

Evaluating correlation between mutation effect and sequence likelihood

Exploring impact of sequence length on mutation prediction accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses genomic language models for DNA sequences

Introduces Nullsettes for mutation prediction

Considers sequence likelihood and length

🔎 Similar Papers

A PLMs based protein retrieval framework