When is String Reconstruction using de Bruijn Graphs Hard?

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This paper studies the de Bruijn graph string reconstruction problem under domain-knowledge constraints: given a position-interval mapping function (c) specifying permissible locations for each length-(k) substring, the goal is to find an Eulerian trail satisfying all edge-position constraints. We introduce the parameter (w = lceil log w + 1 ceil / (k-1)) to quantify the quality of domain knowledge—specifically, the relative width of position intervals—and establish the first exponential-time improvement for this problem class: when interval widths are significantly smaller than (k), reconstruction becomes tractable. Leveraging parameterized algorithm design, interval-constraint modeling, and combinatorial graph-theoretic analysis, we reduce the runtime from (O(m cdot w^{1.5} 4^w)) to a substantially improved bound. Our approach yields the first solution for constrained string reconstruction that simultaneously provides rigorous theoretical guarantees and practical efficiency, with direct applications in genome assembly and privacy-preserving data reconstruction.

Technology Category

Application Category

📝 Abstract

The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph $G=(V,E)$ of order $k$ over an alphabet $Σ$. A single Eulerian trail in $G$ represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on $z$-anonymity. The pressing question is: How hard is it to reconstruct a best string from a de Bruijn graph given a function that models domain knowledge? Such a function maps every length-$k$ string to an interval of positions where it may occur in the reconstructed string. By the above reduction to de Bruijn graphs, the latter function translates into a function $c$ mapping every edge to an interval where it may occur in an Eulerian trail. This gives rise to the following basic problem on graphs: Given an instance $(G,c)$, can we efficiently compute an Eulerian trail respecting $c$? Hannenhalli et al.~[CABIOS 1996] formalized this problem and showed that it is NP-complete. We focus on parametrization aiming to capture the quality of our domain knowledge in the complexity. Ben-Dor et al. developed an algorithm to solve the problem on de Bruijn graphs in $O(m cdot w^{1.5} 4^{w})$ time, where $m=|E|$ and $w$ is the maximum interval length over all edges. Bumpus and Meeks [Algorithmica 2023] rediscovered the same algorithm on temporal graphs, highlighting the relevance of this problem in other contexts. We give combinatorial insights that lead to exponential-time improvements over the state-of-the-art. For the important class of de Bruijn graphs, we develop an algorithm parametrized by $w (log w+1) /(k-1)$. Our improved algorithm shows that it is enough when the range of positions is small relative to $k$.

Problem

Research questions and friction points this paper is trying to address.

Determine difficulty of string reconstruction using de Bruijn graphs.

Assess NP-completeness of Eulerian trail with interval constraints.

Improve algorithms for de Bruijn graphs with small position ranges.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses de Bruijn graphs for string reconstruction

Parametrized algorithm for small position ranges

Exponential-time improvements over prior methods

🔎 Similar Papers

To Infinity and Beyond: Continuing De Bruijn Sequences by Extending the Alphabet