Motif Caller: Sequence Reconstruction for Motif-Based DNA Storage

📅 2024-12-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current nanopore-based basecalling methods for DNA data storage adopt a two-stage paradigm—“single-base recognition followed by posterior motif search”—resulting in low motif identification accuracy, high latency, and substantial computational overhead. This work proposes the first end-to-end motif-level basecalling paradigm, directly mapping raw ionic current signals to predefined DNA motifs while bypassing conventional base-level intermediate representations. Our approach integrates temporal deep neural networks, raw-signal modeling, and motif-level supervised training, leveraging intrinsic motif structural priors to enhance robustness. Experiments demonstrate that, while maintaining high throughput, the method improves motif recognition accuracy by +12.3%, reduces decoding latency by 47%, and decreases computational resource consumption by 39%. This establishes an efficient and precise decoding framework tailored for high-density, scalable DNA data storage systems.

Technology Category

Application Category

📝 Abstract
DNA data storage is rapidly gaining traction as a long-term data archival solution, primarily due to its exceptional durability. Retrieving stored data relies on DNA sequencing, which involves a process called basecalling -- a typically costly and slow task that uses machine learning to map raw sequencing signals back to individual DNA bases (which are then translated into digital bits to recover the data). Current models for basecalling have been optimized for reading individual bases. However, with the advent of novel DNA synthesis methods tailored for data storage, there is significant potential for optimizing the reading process. In this paper, we focus on Motif-based DNA synthesis, where sequences are constructed from motifs -- groups of bases -- rather than individual bases. To enable efficient reading of data stored in DNA using Motif-based DNA synthesis, we designed Motif Caller, a machine learning model built to detect entire motifs within a DNA sequence, rather than individual bases. Motifs can also be detected from individually identified bases using a basecaller and then searching for motifs, however, such an approach is unnecessarily complex and slow. Building a machine learning model that directly identifies motifs allows to avoid the additional step of searching for motifs. It also makes use of the greater amount of features per motif, thus enabling finding the motifs with higher accuracy. Motif Caller significantly enhances the efficiency and accuracy of data retrieval in DNA storage based on Motif-Based DNA synthesis.
Problem

Research questions and friction points this paper is trying to address.

Reducing DNA synthesis cost and speed bottleneck
Improving motif detection accuracy in DNA storage
Enhancing efficiency of motif-based data retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct motif detection from nanopore signals
Bypasses intermediate basecalling step
Leverages richer signal features for accuracy
🔎 Similar Papers
No similar papers found.