Space-Efficient k-Mismatch Text Indexes

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the excessive space overhead of k-mismatch text indexing. We propose a novel index structure that retains the optimal query time of $O(log^k n log log n + m + mathrm{occ})$ while reducing space complexity from the classical $O(n log^k n)$ to $O(n log^{k-1} n)$. For constant-size alphabets, we further compress it to $O(n log^{k-1.5+varepsilon} n)$, breaking a two-decade-old space–time trade-off barrier in this domain. Methodologically, our approach builds upon the k-errata tree framework and integrates a divide-and-conquer strategy with custom-designed data structures, specifically optimizing storage and query efficiency for short patterns. Experimental evaluation demonstrates substantial improvements in scalability and practicality for approximate pattern matching under Hamming distance at most $k$ over large-scale texts.

Technology Category

Application Category

📝 Abstract

A central task in string processing is text indexing, where the goal is to preprocess a text (a string of length $n$) into an efficient index (a data structure) supporting queries about the text. Cole, Gottlieb, and Lewenstein (STOC 2004) proposed $k$-errata trees, a family of text indexes supporting approximate pattern matching queries of several types. In particular, $k$-errata trees yield an elegant solution to $k$-mismatch queries, where we are to report all substrings of the text with Hamming distance at most $k$ to the query pattern. The resulting $k$-mismatch index uses $O(nlog^k n)$ space and answers a query for a length-$m$ pattern in $O(log^k n log log n + m + occ)$ time, where $occ$ is the number of approximate occurrences. In retrospect, $k$-errata trees appear very well optimized: even though a large body of work has adapted $k$-errata trees to various settings throughout the past two decades, the original time-space trade-off for $k$-mismatch indexing has not been improved in the general case. We present the first such improvement, a $k$-mismatch index with $O(nlog^{k-1} n)$ space and the same query time as $k$-errata trees. Previously, due to a result of Chan, Lam, Sung, Tam, and Wong (Algorithmica 2010), such an $O(nlog^{k-1} n)$-size index has been known only for texts over alphabets of constant size. In this setting, however, we obtain an even smaller $k$-mismatch index of size only $O(n log^{k-2+varepsilon+frac{2}{k+2-(k mod 2)}} n)subseteq O(nlog^{k-1.5+varepsilon} n)$ for $2le kle O(1)$ and any constant $varepsilon>0$. Along the way, we also develop improved indexes for short patterns, offering better trade-offs in this practically relevant special case.

Problem

Research questions and friction points this paper is trying to address.

Develops space-efficient indexes for approximate pattern matching

Improves k-mismatch indexing with reduced space complexity

Handles text queries with limited character mismatches efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Space-efficient k-mismatch index using O(n log^{k-1} n) space

Maintains same query time as original k-errata trees

Further size reduction for constant-size alphabets

🔎 Similar Papers

A Parametrizable Algorithm for Distributed Approximate Similarity Search with Arbitrary Distances