🤖 AI Summary
This work addresses the minimizer construction problem in *k*-mer sampling: given alphabet size σ, *k*-mer length *k*, and window length *w*, find a total order over *k*-mers that minimizes the expected density (i.e., selection frequency) of chosen *k*-mers in random strings. We formulate this for the first time as a tractable combinatorial optimization problem. Our method integrates regular languages, finite automata, and spectral analysis to design an asymptotically optimal algorithm and derive a novel theoretical lower bound on achievable density. Combining integer linear programming, sliding-window constraints, and enumeration pruning, we achieve the first exact solutions across all *w* ≥ 2 for parameter sets (σ, *k*) ∈ {(2,2), (2,3), (2,4), (2,5), (4,2)}. The resulting densities are substantially below the average density and approach our new lower bound tightly.
📝 Abstract
Minimizers are sampling schemes with numerous applications in computational biology. Assuming a fixed alphabet of size $sigma$, a minimizer is defined by two integers $k,wge2$ and a linear order $
ho$ on strings of length $k$ (also called $k$-mers). A string is processed by a sliding window algorithm that chooses, in each window of length $w+k-1$, its minimal $k$-mer with respect to $
ho$. A key characteristic of the minimizer is its density, which is the expected frequency of chosen $k$-mers among all $k$-mers in a random infinite $sigma$-ary string. Minimizers of smaller density are preferred as they produce smaller samples with the same guarantee: each window is represented by a $k$-mer. The problem of finding a minimizer of minimum density for given input parameters $(sigma,k,w)$ has a huge search space of $(sigma^k)!$ and is representable by an ILP of size $ ildeTheta(sigma^{k+w})$, which has worst-case solution time that is doubly-exponential in $(k+w)$ under standard complexity assumptions. We solve this problem in $wcdot 2^{sigma^k+O(k)}$ time and provide several additional tricks reducing the practical runtime and search space. As a by-product, we describe an algorithm computing the average density of a minimizer within the same time bound. Then we propose a novel method of studying minimizers via regular languages and show how to find, via the eigenvalue/eigenvector analysis over finite automata, minimizers with the minimal density in the asymptotic case $w oinfty$. Implementing our algorithms, we compute the minimum density minimizers for $(sigma,k)in{(2,2),(2,3),(2,4),(2,5),(4,2)}$ and extbf{all} $wge 2$. The obtained densities are compared against the average density and the theoretical lower bounds, including the new bound presented in this paper.