Contextual Pattern Mining and Counting

📅 2025-06-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of context mining and counting for patterns in strings: given a text (T) and parameters ((m,l,r)), the context of a pattern (P) comprises all left-right context pairs ((L,R)) with (|L|=l) and (|R|=r) such that (LPR) occurs in (T). We formulate two core problems—Context Pattern Mining (CPM), which enumerates all length-(m) substrings whose context size exceeds threshold ( au); and Context Pattern Counting (CPC), which supports fast queries of the context size for any pattern (P). We present the first linear-time CPM algorithm and a near-linear-space CPC index, innovatively integrating LZ77 compression, external-memory optimization, and upper/lower bound pruning. Our approach enables efficient processing of billion-scale datasets. Experiments demonstrate that our index construction and query performance outperform state-of-the-art methods by over an order of magnitude, while significantly reducing memory consumption.

Technology Category

Application Category

📝 Abstract
Given a string $P$ of length $m$, a longer string $T$ of length $n>m$, and two integers $lgeq 0$ and $rgeq 0$, the context of $P$ in $T$ is the set of all string pairs $(L,R)$, with $|L|=l$ and $|R|=r$, such that the string $LPR$ occurs in $T$. We introduce two problems related to the notion of context: (1) the Contextual Pattern Mining (CPM) problem, which given $T$, $(m,l,r)$, and an integer $τ>0$, asks for outputting the context of each substring $P$ of length $m$ of $T$, provided that the size of the context of $P$ is at least $τ$; and (2) the Contextual Pattern Counting (CPC) problem, which asks for preprocessing $T$ so that the size of the context of a given query string $P$ of length $m$ can be found efficiently. For CPM, we propose a linear-work algorithm that either uses only internal memory, or a bounded amount of internal memory and external memory, which allows much larger datasets to be handled. For CPC, we propose an $widetilde{mathcal{O}}(n)$-space index that can be constructed in $widetilde{mathcal{O}}n)$ time and answers queries in $mathcal{O}(m)+widetilde{mathcal{O}}(1)$ time. We further improve the practical performance of the CPC index by optimizations that exploit the LZ77 factorization of $T$ and an upper bound on the query length. Using billion-letter datasets from different domains, we show that the external memory version of our CPM algorithm can deal with very large datasets using a small amount of internal memory while its runtime is comparable to that of the internal memory version. Interestingly, we also show that our optimized index for CPC outperforms an approach based on the state of the art for the reporting version of CPC [Navarro, SPIRE 2020] in terms of query time, index size, construction time, and construction space, often by more than an order of magnitude.
Problem

Research questions and friction points this paper is trying to address.

Mine contextual patterns in large datasets efficiently
Count contextual pattern occurrences with optimized indexing
Handle very large datasets using limited memory resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear-work algorithm for Contextual Pattern Mining
Efficient index for Contextual Pattern Counting
LZ77 optimization for improved query performance
🔎 Similar Papers
No similar papers found.