🤖 AI Summary
This paper studies the de Bruijn graph string reconstruction problem under domain-knowledge constraints: given a position-interval mapping function (c) specifying permissible locations for each length-(k) substring, the goal is to find an Eulerian trail satisfying all edge-position constraints. We introduce the parameter (w = lceil log w + 1
ceil / (k-1)) to quantify the quality of domain knowledge—specifically, the relative width of position intervals—and establish the first exponential-time improvement for this problem class: when interval widths are significantly smaller than (k), reconstruction becomes tractable. Leveraging parameterized algorithm design, interval-constraint modeling, and combinatorial graph-theoretic analysis, we reduce the runtime from (O(m cdot w^{1.5} 4^w)) to a substantially improved bound. Our approach yields the first solution for constrained string reconstruction that simultaneously provides rigorous theoretical guarantees and practical efficiency, with direct applications in genome assembly and privacy-preserving data reconstruction.
📝 Abstract
The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph $G=(V,E)$ of order $k$ over an alphabet $Σ$. A single Eulerian trail in $G$ represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on $z$-anonymity.
The pressing question is: How hard is it to reconstruct a best string from a de Bruijn graph given a function that models domain knowledge? Such a function maps every length-$k$ string to an interval of positions where it may occur in the reconstructed string. By the above reduction to de Bruijn graphs, the latter function translates into a function $c$ mapping every edge to an interval where it may occur in an Eulerian trail. This gives rise to the following basic problem on graphs: Given an instance $(G,c)$, can we efficiently compute an Eulerian trail respecting $c$? Hannenhalli et al.~[CABIOS 1996] formalized this problem and showed that it is NP-complete.
We focus on parametrization aiming to capture the quality of our domain knowledge in the complexity. Ben-Dor et al. developed an algorithm to solve the problem on de Bruijn graphs in $O(m cdot w^{1.5} 4^{w})$ time, where $m=|E|$ and $w$ is the maximum interval length over all edges. Bumpus and Meeks [Algorithmica 2023] rediscovered the same algorithm on temporal graphs, highlighting the relevance of this problem in other contexts. We give combinatorial insights that lead to exponential-time improvements over the state-of-the-art. For the important class of de Bruijn graphs, we develop an algorithm parametrized by $w (log w+1) /(k-1)$. Our improved algorithm shows that it is enough when the range of positions is small relative to $k$.