Regular Expression Indexing for Log Analysis. Extended Version

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low efficiency of regular expression (regex) queries over log data, poor performance due to the lack of indexing support in existing regex engines, and excessive storage overhead, this paper proposes REI—a lightweight, regex-aware indexing system. REI introduces a novel n-gram–based index tailored for regex workloads, overcoming the limitations of conventional inverted indexes in log analytics. It employs a jointly optimized indexing construction and storage strategy that achieves significant query acceleration while strictly bounding space amplification. Furthermore, REI supports modular integration and is fully compatible with mainstream regex engines without modification. Experimental evaluation demonstrates that REI delivers up to 14× speedup over state-of-the-art unindexed regex engines, with only 2.1% additional storage overhead. These results highlight REI’s effectiveness in enhancing both the timeliness and resource efficiency of large-scale log analysis.

Technology Category

Application Category

📝 Abstract
In this paper, we present the design and architecture of REI, a novel system for indexing log data for regular expression queries. Our main contribution is an $n$-gram-based indexing strategy and an efficient storage mechanism that results in a speedup of up to 14x compared to state-of-the-art regex processing engines that do not use indexing, using only 2.1% of extra space. We perform a detailed study that analyzes the space usage of the index and the improvement in workload execution time, uncovering interesting insights. Specifically, we show that even an optimized implementation of strategies such as inverted indexing, which are widely used in text processing libraries, may lead to suboptimal performance for regex indexing on log analysis tasks. Overall, the REI approach presented in this paper provides a significant boost when evaluating regular expression queries on log data. REI is also modular and can work with existing regular expression packages, making it easy to deploy in a variety of settings. The code of REI is available at https://github.com/mush-zhang/REI-Regular-Expression-Indexing.
Problem

Research questions and friction points this paper is trying to address.

Designing efficient regex indexing system for log data analysis
Accelerating regex queries with n-gram indexing and storage optimization
Overcoming limitations of traditional text indexing for log processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

n-gram-based indexing strategy for log data
efficient storage with minimal extra space
modular integration with existing regex packages
🔎 Similar Papers