In Search of Lost DNA Sequence Pretraining

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This study addresses critical yet overlooked challenges in current DNA pretraining research, including the lack of rigor in downstream dataset selection, limitations of proximity-based masking strategies, and suboptimal vocabulary design. To tackle these issues, the authors propose principled evaluation criteria for dataset curation, task design guidelines, and a systematic approach to vocabulary construction. They further establish the first standardized benchmarking platform to enable reproducible evaluations of genomic foundation models. Through extensive experiments—including large-scale pretraining, comparative analysis of masking strategies, vocabulary ablation studies, and cross-task assessments—the work demonstrates that the proposed methodologies substantially enhance both model performance and evaluation reliability, thereby advancing the field toward more rigorous and standardized development of genomic foundation models.

Technology Category

Application Category

📝 Abstract
DNA sequence encoding is fundamental to gene function prediction, protein synthesis, and diverse downstream biological tasks. Despite the substantial progress achieved by large-scale DNA sequence pretraining, existing studies have overwhelmingly emphasized pretraining scale and custom downstream evaluation datasets, while neglecting some essential components of the pretraining paradigm. In this paper, we reveal three critical yet heretofore overlooked problems in DNA pretraining: inappropriate downstream datasets, inherent flaws in the neighbor-masking strategy, and the lack of detailed discussion on vocabulary. Therefore, we undertake comprehensive investigations and propose principled guidelines, including selection criteria for evaluation datasets, guiding task design, and in-depth vocabulary analysis. Extensive experiments validate the significance of our identified problems and support the rationale behind our recommendations. Finally, we introduce a standardized testbed that enables reproducible and rigorous benchmarking of DNA pretraining methods to advance the development of genomic foundation models.
Problem

Research questions and friction points this paper is trying to address.

DNA sequence pretraining
downstream datasets
neighbor-masking strategy
vocabulary analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

DNA sequence pretraining
evaluation benchmark
masking strategy
vocabulary design
genomic foundation models
🔎 Similar Papers
No similar papers found.
Zhijiang Tang
Zhijiang Tang
Postgraduate student at University of Chinese Academy of Sciences
Deep LearningAI for ScienceTime Series Analyze
J
Jiaxin Qi
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Zhejiang, China
Y
Yan Cui
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Zhejiang, China
J
Jinli Ou
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Zhejiang, China
Y
Yuhua Zheng
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Zhejiang, China
Jianqiang Huang
Jianqiang Huang
Nanyang Technological University, Chinese Academy of Sciences
Compter VisionMachine LearningCasuality