🤖 AI Summary
This paper addresses the efficient counting and identification of repetitive structures—such as k-runs and k-repetitions—and parameterized squares in strings under at most k mismatches. We establish the first theoretical upper bound of *O(nk log k)* on the output size of k-runs, and leverage it to design a unified framework that enumerates parameterized squares in *O(nσ log σ)* time. The framework is further generalized to enumerate inequivalent squares under various substring equivalence relations (e.g., parameterized, order-preserving, Cartesian), achieving *O(n log n)* time for reporting all non-equivalent squares. Key contributions include: (i) a tight upper bound on the number of k-runs; (ii) a generic enumeration paradigm compatible with multiple equivalence relations; and (iii) an integrated algorithmic approach combining k-mismatch matching, parameterized matching, and deduplication techniques—significantly improving computational efficiency for generalized repetitive structures.
📝 Abstract
A $k$-mismatch square is a string of the form $XY$ where $X$ and $Y$ are two equal-length strings that have at most $k$ mismatches. Kolpakov and Kucherov [Theor. Comput. Sci., 2003] defined two notions of $k$-mismatch repeats, called $k$-repetitions and $k$-runs, each representing a sequence of consecutive $k$-mismatch squares of equal length. They proposed algorithms for computing $k$-repetitions and $k$-runs working in $O(nk log k + output)$ time for a string of length $n$ over an integer alphabet, where $output$ is the number of the reported repeats. We show that $output=O(nk log k)$, both in case of $k$-repetitions and $k$-runs, which implies that the complexity of their algorithms is actually $O(nk log k)$. We apply this result to computing parameterized squares.
A parameterized square is a string of the form $XY$ such that $X$ and $Y$ parameterized-match, i.e., there exists a bijection $f$ on the alphabet such that $f(X) = Y$. Two parameterized squares $XY$ and $X'Y'$ are equivalent if they parameterized match. Recently Hamai et al. [SPIRE 2024] showed that a string of length $n$ over an alphabet of size $σ$ contains less than $nσ$ non-equivalent parameterized squares, improving an earlier bound by Kociumaka et al. [Theor. Comput. Sci., 2016]. We apply our bound for $k$-mismatch repeats to propose an algorithm that reports all non-equivalent parameterized squares in $O(nσlog σ)$ time. We also show that the number of non-equivalent parameterized squares can be computed in $O(n log n)$ time. This last algorithm applies to squares under any substring compatible equivalence relation and also to counting squares that are distinct as strings. In particular, this improves upon the $O(nσ)$-time algorithm of Gawrychowski et al. [CPM 2023] for counting order-preserving squares that are distinct as strings if $σ= ω(log n)$.