🤖 AI Summary
DNA sequences are inherently long, sparse, and noisy, posing challenges for existing numerical representation methods to simultaneously capture local structural patterns, generalize from limited samples, and retain biological interpretability. To address this, we propose an interpretable DNA representation framework grounded in sparse recovery: high-frequency k-mers serve as semantically meaningful basis vectors, and fixed-length numerical encodings are generated via concatenative sparse reconstruction—establishing, for the first time, a principled coupling between sparse recovery and DNA semantic structure. The resulting representations are both robust and biologically interpretable, enabling motif discovery and unsupervised clustering without supervision. In promoter classification, our method achieves a 13% accuracy improvement over prior state-of-the-art approaches. Moreover, it significantly enhances clustering coherence and functional element resolution, demonstrating superior performance in both supervised and unsupervised downstream tasks.
📝 Abstract
DNA sequences encode vital genetic and biological information, yet these unfixed-length sequences cannot serve as the input of common data mining algorithms. Hence, various representation schemes have been developed to transform DNA sequences into fixed-length numerical representations. However, these schemes face difficulties in learning high-quality representations due to the complexity and sparsity of DNA data. Additionally, DNA sequences are inherently noisy because of mutations. While several schemes have been proposed for their effectiveness, they often lack semantic structure, making it difficult for biologists to validate and leverage the results. To address these challenges, we propose extbf{Dy-mer}, an explainable and robust DNA representation scheme based on sparse recovery. Leveraging the underlying semantic structure of DNA, we modify the traditional sparse recovery to capture recurring patterns indicative of biological functions by representing frequent K-mers as basis vectors and reconstructing each DNA sequence through simple concatenation. Experimental results demonstrate that extbf{Dy-mer} achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable extbf{13%} increase in accuracy. Moreover, its inherent explainability facilitates DNA clustering and motif detection, enhancing its utility in biological research.