The Random Variables of the DNA Coverage Depth Problem

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This paper addresses the coverage depth problem in DNA-based random access storage, aiming to minimize the expected number of reads required to recover a target information strand under linear encoding. We propose two novel coding constructions: an explicit linear-code-based scheme and a geometric coding framework built upon balanced quasi-arcs. We establish rigorous asymptotic performance bounds for both. Crucially, we provide the first systematic analysis of the full probability distribution—not merely the expectation—of coverage depth, revealing distinctions among codes that conventional metrics (e.g., mean coverage) fail to capture; this both validates and corrects key conjectures in the field. Experimental and theoretical results demonstrate that our methods significantly reduce the expected read count while simultaneously improving higher-order statistical properties—including variance and tail decay—thereby enabling highly reliable, low-overhead DNA random access.

Technology Category

Application Category

📝 Abstract

DNA data storage systems encode digital data into DNA strands, enabling dense and durable storage. Efficient data retrieval depends on coverage depth, a key performance metric. We study the random access coverage depth problem and focus on minimizing the expected number of reads needed to recover information strands encoded via a linear code. We compute the asymptotic performance of a recently proposed code construction, establishing and refining a conjecture in the field by giving two independent proofs. We also analyze a geometric code construction based on balanced quasi-arcs and optimize its parameters. Finally, we investigate the full distribution of the random variables that arise in the coverage depth problem, of which the traditionally studied expectation is just the first moment. This allows us to distinguish between code constructions that, at first glance, may appear to behave identically.

Problem

Research questions and friction points this paper is trying to address.

Minimize expected reads for DNA data retrieval

Analyze asymptotic performance of linear codes

Study full distribution of coverage depth variables

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing asymptotic performance of linear codes

Optimizing geometric code with balanced quasi-arcs

Studying full distribution of coverage depth variables

🔎 Similar Papers

On the Coverage Required for Diploid Genome Assembly