Counting Distinct Square Substrings in Sublinear Time

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of efficiently counting distinct squares in packed strings (compressed strings). Traditional algorithms require linear time, constituting a fundamental bottleneck. We propose the first sublinear-time algorithm: operating in the word-RAM model, it leverages novel combinatorial structures—sparse Lyndon roots and layered cyclic groups—and integrates cyclic extraction, string synchronizing sets, and Lyndon root computation to overcome periodicity barriers. Our algorithm exactly counts all distinct squares in a packed string of length $n$ in $O(n / log_sigma n)$ time, where $sigma$ is the alphabet size. This yields the first sublinear solution for analyzing repetitive patterns in compressed text, significantly advancing the theoretical frontier of string algorithms in the compressed domain.

Technology Category

Application Category

📝 Abstract
We show that the number of distinct squares in a packed string of length $n$ over an alphabet of size $σ$ can be computed in $O(n/log_σn)$ time in the word-RAM model. This paper is the first to introduce a sublinear-time algorithm for counting squares in the packed setting. The packed representation of a string of length $n$ over an alphabet of size $σ$ is given as a sequence of $O(n/log_σn)$ machine words in the word-RAM model (a machine word consists of $ωge log_2 n$ bits). Previously, it was known how to count distinct squares in $O(n)$ time [Gusfield and Stoye, JCSS 2004], even for a string over an integer alphabet [Crochemore et al., TCS 2014; Bannai et al., CPM 2017; Charalampopoulos et al., SPIRE 2020]. We use the techniques for extracting squares from runs described by Crochemore et al. [TCS 2014]. However, the packed model requires novel approaches. We need an $O(n/log_σn)$-sized representation of all long-period runs (runs with period $Ω(log_σn)$) which allows for a sublinear-time counting of the -- potentially linearly-many -- implied squares. The long-period runs with a string period that is periodic itself (called layer runs) are an obstacle, since their number can be $Ω(n)$. The number of all other long-period runs is $O(n/log_σn)$ and we can construct an implicit representation of all long-period runs in $O(n/log_σn)$ time by leveraging the insights of Amir et al. [ESA 2019]. We count squares in layer runs by exploiting combinatorial properties of pyramidally-shaped groups of layer runs. Another difficulty lies in computing the locations of Lyndon roots of runs in packed strings, which is needed for grouping runs that may generate equal squares. To overcome this difficulty, we introduce sparse-Lyndon roots which are based on string synchronizers [Kempa and Kociumaka, STOC 2019].
Problem

Research questions and friction points this paper is trying to address.

Counting distinct square substrings in sublinear time
Handling packed string representation efficiently
Overcoming obstacles with long-period and layer runs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sublinear-time algorithm for packed strings
Implicit representation of long-period runs
Sparse-Lyndon roots using string synchronizers
🔎 Similar Papers
No similar papers found.