Counting distinct (non-)crossing substrings

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This paper addresses the problem of counting distinct substrings of a string $w$ of length $n$ with respect to positional constraints: for each position $k$, compute $C(w,k)$, the number of distinct substrings containing $k$, and $N(w,k)$, the number not containing $k$. We propose the first linear-time algorithms for both quantities. Specifically, we compute all $C(w,k)$ exactly in $O(n)$ total time over general ordered alphabets; for $N(w,k)$, we achieve $O(n)$ time under the assumption that the alphabet is linearly sortable (e.g., integer alphabets). Our approach integrates suffix arrays, LCP arrays, and sweep-line techniques, with data structure operations optimized according to alphabet properties. This improves upon the naive $O(n^2)$ enumeration-based methods by an order-of-magnitude speedup, enabling efficient position-sensitive substring analysis in large-scale string processing.

Technology Category

Application Category

📝 Abstract

Let $w$ be a string of length $n$. The problem of counting factors crossing a position - Problem 64 from the textbook ``125 Problems in Text Algorithms'' [Crochemore, Leqroc, and Rytter, 2021], asks to count the number $mathcal{C}(w,k)$ (resp. $mathcal{N}(w,k)$) of distinct substrings in $w$ that have occurrences containing (resp. not containing) a position $k$ in $w$. The solutions provided in their textbook compute $mathcal{C}(w,k)$ and $mathcal{N}(w,k)$ in $O(n)$ time for a single position $k$ in $w$, and thus a direct application would require $O(n^2)$ time for all positions $k = 1, ldots, n$ in $w$. Their solution is designed for constant-size alphabets. In this paper, we present new algorithms which compute $mathcal{C}(w,k)$ in $O(n)$ total time for general ordered alphabets, and $mathcal{N}(w,k)$ in $O(n)$ total time for linearly sortable alphabets, for all positions $k = 1, ldots, n$ in $w$.

Problem

Research questions and friction points this paper is trying to address.

Count distinct substrings crossing position k in a string.

Improve time complexity from O(n^2) to O(n) for all positions.

Extend solution to general and linearly sortable alphabets.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear time for general ordered alphabets

Efficient counting of crossing substrings

Optimized for linearly sortable alphabets

🔎 Similar Papers

Counting overlapping pairs of words