🤖 AI Summary
This study addresses the fundamental information-theoretic limits of complete diploid genome assembly, specifically establishing the first lower bound on the minimum required sequencing depth. It focuses on the critical bottleneck of resolving repetitive regions, particularly the challenge of “twin-repeat spanning.”
Method: We systematically compare greedy assembly and de Bruijn graph–based approaches in terms of their coverage requirements and read-length sensitivity for traversing twin-repeat structures. Using information-theoretic modeling, solvability analysis, and algorithmic simulations, we quantify the excess coverage—i.e., redundancy—of state-of-the-art assemblers relative to the theoretical lower bound.
Results: Our analysis reveals that current methods incur redundancy exceeding 100%—i.e., more than double the theoretically minimal coverage. Key contributions are: (1) the first information-theoretic lower bound for diploid genome assembly; (2) a rigorous demonstration that existing algorithms inherently introduce substantial redundancy due to unavoidable twin-repeat spanning; and (3) a principled theoretical benchmark guiding optimized sequencing strategies and next-generation assembler design.
📝 Abstract
We investigate the information-theoretic conditions to achieve the complete reconstruction of a diploid genome. We also analyze the standard greedy and de-Bruijn graph-based algorithms and compare the coverage depth and read length requirements with the information-theoretic lower bound. Our results show that the gap between the two is considerable because both algorithms require the double repeats in the genome to be bridged.