🤖 AI Summary
Chaotic Game Representation (CGR) lacks a rigorous mathematical link to k-mer statistics, hindering its interpretability and utility in sequence modeling and generation.
Method: We establish an exact mathematical correspondence between CGR and k-mer frequency spectra, proving that the Frequency CGR (FCGR) at resolution $2^k imes 2^k$ is equivalent to a discretized mapping of the k-mer spectrum. Leveraging De Bruijn multigraph Eulerian paths, we devise a reversible sequence reconstruction algorithm that unifies CGR’s geometric representation, k-mer statistical modeling, and exact sequence synthesis. Our framework incorporates k-mer spectral vectorization, CGR symmetry analysis, and graph-theoretic constraints to generate DNA sequences—and their corresponding CGR images—that precisely match target k-mer distributions.
Contribution/Results: Experiments demonstrate high-fidelity k-mer spectral matching on both real genomes and synthetic distributions. The method produces high-quality, interpretable CGR images suitable for deep learning training, establishing a novel, reconstructible paradigm for genomic visualization, sequence modeling, and data augmentation.
📝 Abstract
This paper establishes formal mathematical foundations linking Chaos Game Representations (CGR) of DNA sequences to their underlying $k$-mer frequencies. We prove that the Frequency CGR (FCGR) of order $k$ is mathematically equivalent to a discretization of CGR at resolution $2^k imes 2^k$, and its vectorization corresponds to the $k$-mer frequencies of the sequence. Additionally, we characterize how symmetry transformations of CGR images correspond to specific nucleotide permutations in the originating sequences. Leveraging these insights, we introduce an algorithm that generates synthetic DNA sequences from prescribed $k$-mer distributions by constructing Eulerian paths on De Bruijn multigraphs. This enables reconstruction of sequences matching target $k$-mer profiles with arbitrarily high precision, facilitating the creation of synthetic CGR images for applications such as data augmentation for machine learning-based taxonomic classification of DNA sequences. Numerical experiments validate the effectiveness of our method across both real genomic data and artificially sampled distributions. To our knowledge, this is the first comprehensive framework that unifies CGR geometry, $k$-mer statistics, and sequence reconstruction, offering new tools for genomic analysis and visualization.