A faster algorithm for efficient longest common substring calculation for non-parametric entropy estimation in sequential data

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

In nonparametric entropy estimation for sequential data, the computation of the longest common substring (LCS) suffers from poor efficiency—up to *O*(*n*³) in the worst case—and lacks support for dynamically growing sequences. To address this, we propose an efficient LCS algorithm leveraging sorted suffix arrays and persistent binary search trees. By carefully designing a matching mechanism, our method reduces the worst-case time complexity of LCS queries on dynamic sequences to *O*(*n* log *n*), enabling the first real-time, scalable nonparametric entropy estimation over continuously expanding sequences. Experiments on both real-world and synthetic datasets demonstrate that our approach achieves 10×–100× speedup over state-of-the-art methods, making online nonparametric entropy analysis feasible for large-scale signal streams. This significantly extends the applicability of nonparametric entropy estimation in real-time information processing.

Technology Category

Application Category

📝 Abstract

Non-parametric entropy estimation on sequential data is a fundamental tool in signal processing, capturing information flow within or between processes to measure predictability, redundancy, or similarity. Methods based on longest common substrings (LCS) provide a non-parametric estimate of typical set size but are often inefficient, limiting use on real-world data. We introduce LCSFinder, a new algorithm that improves the worst-case performance of LCS calculations from cubic to log-linear time. Although built on standard algorithmic constructs - including sorted suffix arrays and persistent binary search trees - the details require care to provide the matches required for entropy estimation on dynamically growing sequences. We demonstrate that LCSFinder achieves dramatic speedups over existing implementations on real and simulated data, enabling entropy estimation at scales previously infeasible in practical signal processing.

Problem

Research questions and friction points this paper is trying to address.

Improving efficiency of longest common substring calculations

Enabling entropy estimation on large-scale sequential data

Reducing computational complexity from cubic to log-linear time

Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithm improves LCS calculation to log-linear time

Uses suffix arrays and persistent binary search trees

Enables entropy estimation on dynamically growing sequences

🔎 Similar Papers

Improving Numerical Stability of Normalized Mutual Information Estimator on High Dimensions