🤖 AI Summary
Existing collinear anchor chaining algorithms for long-read alignment (e.g., ChainX) lack theoretical optimality guarantees, often yielding suboptimal chains that fail to minimize edit distance.
Method: We propose the first strictly optimal collinear chaining algorithm, introducing the novel “anchor diagonal distance” metric. By integrating computational geometry modeling, interval-tree indexing, and incremental dynamic programming, our algorithm achieves an average-case time complexity of O(n·OPT + n log n), where OPT denotes the optimal chain cost.
Contribution/Results: The algorithm is theoretically guaranteed to output a chain whose cost equals the minimum edit distance. Empirical evaluation on real long-read datasets reveals significant suboptimality in ChainX; our method consistently attains the global optimum with negligible computational overhead—achieving runtime performance comparable to ChainX while ensuring strict optimality.
📝 Abstract
Colinear chaining is a classical heuristic for sequence alignment and is widely used in modern practical aligners. Jain et al. (J. Comput. Biol. 2022) proposed an $O(n log^3 n)$ time algorithm to chain a set of $n$ anchors so that the chaining cost matches the edit distance of the input sequences, when anchors are maximal exact matches. Moreover, assuming a uniform and sparse distribution of anchors, they provided a practical solution ($mathtt{ChainX}$) working in $O(n cdot mathsf{SOL} + n log n)$ average-case time, where $mathsf{SOL}$ is the cost of the output chain and $n$ is the number of anchors in the input. This practical solution is not guaranteed to be optimal: we study the failing cases, introduce the anchor diagonal distance, and find and implement an optimal algorithm working in the same $O(n cdot mathsf{OPT} + n log n)$ average-case time, where $mathsf{OPT}$ is the optimal chaining cost; then, we validate the results by Jain et al., show that $mathtt{ChainX}$ can be suboptimal with a realistic long read dataset, and show minimal computational slowdown for our solution.