On Minimizers of Minimum Density

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the minimizer construction problem in *k*-mer sampling: given alphabet size σ, *k*-mer length *k*, and window length *w*, find a total order over *k*-mers that minimizes the expected density (i.e., selection frequency) of chosen *k*-mers in random strings. We formulate this for the first time as a tractable combinatorial optimization problem. Our method integrates regular languages, finite automata, and spectral analysis to design an asymptotically optimal algorithm and derive a novel theoretical lower bound on achievable density. Combining integer linear programming, sliding-window constraints, and enumeration pruning, we achieve the first exact solutions across all *w* ≥ 2 for parameter sets (σ, *k*) ∈ {(2,2), (2,3), (2,4), (2,5), (4,2)}. The resulting densities are substantially below the average density and approach our new lower bound tightly.

Technology Category

Application Category

📝 Abstract

Minimizers are sampling schemes with numerous applications in computational biology. Assuming a fixed alphabet of size $sigma$, a minimizer is defined by two integers $k,wge2$ and a linear order $ ho$ on strings of length $k$ (also called $k$-mers). A string is processed by a sliding window algorithm that chooses, in each window of length $w+k-1$, its minimal $k$-mer with respect to $ ho$. A key characteristic of the minimizer is its density, which is the expected frequency of chosen $k$-mers among all $k$-mers in a random infinite $sigma$-ary string. Minimizers of smaller density are preferred as they produce smaller samples with the same guarantee: each window is represented by a $k$-mer. The problem of finding a minimizer of minimum density for given input parameters $(sigma,k,w)$ has a huge search space of $(sigma^k)!$ and is representable by an ILP of size $ ildeTheta(sigma^{k+w})$, which has worst-case solution time that is doubly-exponential in $(k+w)$ under standard complexity assumptions. We solve this problem in $wcdot 2^{sigma^k+O(k)}$ time and provide several additional tricks reducing the practical runtime and search space. As a by-product, we describe an algorithm computing the average density of a minimizer within the same time bound. Then we propose a novel method of studying minimizers via regular languages and show how to find, via the eigenvalue/eigenvector analysis over finite automata, minimizers with the minimal density in the asymptotic case $w oinfty$. Implementing our algorithms, we compute the minimum density minimizers for $(sigma,k)in{(2,2),(2,3),(2,4),(2,5),(4,2)}$ and extbf{all} $wge 2$. The obtained densities are compared against the average density and the theoretical lower bounds, including the new bound presented in this paper.

Problem

Research questions and friction points this paper is trying to address.

Finding minimizers with minimum density for given parameters (σ, k, w).

Reducing the search space and runtime for minimizer computation.

Analyzing minimizers via regular languages for asymptotic cases (w → ∞).

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient ILP-based minimizer search algorithm

Regular language analysis for asymptotic minimizers

Practical runtime optimization techniques

🔎 Similar Papers

Unsupervised Machine Learning Hybrid Approach Integrating Linear Programming in Loss Function: A Robust Optimization Technique