Longest Common Extensions with Wildcards: Trade-off and Applications

📅 2024-08-07
🏛️ Embedded Systems and Applications
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the Longest Common Extension (LCE) problem for strings with wildcards, where each wildcard matches any character. Parameterizing by the number $G$ of wildcard groups—a natural structural parameter—we present the first smooth time–space trade-off LCE data structure: it supports $O(t)$-time queries using $O(nG/t)$ space and $O(n(G/t)log n)$ preprocessing time. Technically, our approach integrates group-parameterized design, kangaroo jumping, and sparse Boolean matrix multiplication, and establishes a tight reduction to Boolean matrix multiplication. Under the 3SUM and Set-Disjointness conjectures, we prove the conditional optimality of this trade-off. Our result yields the first combinatorial, parameter-sensitive, and asymptotically optimal primitive for approximate pattern matching and structural analysis of wildcard strings.

Technology Category

Application Category

📝 Abstract
We study the Longest Common Extension (LCE) problem in a string containing wildcards. Wildcards (also called"don't cares"or"holes") are special characters that match any other character in the alphabet, similar to the character"?"in Unix commands or"."in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous groups of wildcards in the input string. Our main contribution is a simple data structure for this problem that can be built in $O(n (G/t) log n)$ time, occupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t in [1, G]$. Up to the $O(log n)$ factor, this interpolates smoothly between the data structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing time and space, and $O(1)$ query time, and a simple solution based on the"kangaroo jumping"technique [Landau and Vishkin, STOC 1986], which has $O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix multiplication, we show that our solution is optimal, up to subpolynomial factors, among combinatorial data structures when $G = Omega(n^epsilon)$ under a widely believed hypothesis. In addition, we develop a simple deterministic combinatorial algorithm for sparse Boolean matrix multiplication. We further establish a conditional lower bound for non-combinatorial data structures, stating that $O(nG/t^4)$ preprocessing time (resp. space) is optimal, up to subpolynomial factors, for any data structure with query time $t$ for a wide range of $t$ and $G$, assuming the well-established $ extsf{3SUM}$ (resp. $ extsf{Set-Disjointness}$) conjecture. Finally, we show that our data structure can be used to obtain efficient algorithms for approximate pattern matching and structural analysis of strings with wildcards.
Problem

Research questions and friction points this paper is trying to address.

Study LCE problem in strings with wildcards
Develop optimal data structure for wildcard LCE
Apply solution to pattern matching and string analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data structure for LCE with wildcards
Optimal combinatorial Boolean matrix multiplication
Efficient approximate pattern matching
🔎 Similar Papers
No similar papers found.
G
Gabriel Bathie
DIENS, École normale supérieure de Paris, PSL Research University, France
P
P. Charalampopoulos
Birkbeck, University of London, UK
Tatiana Starikovskaya
Tatiana Starikovskaya
Ecole Normale Supérieure
Stringologyrandomized algorithmsapproximate algorithmsstreaming algorithmscommunication