🤖 AI Summary
To address the low efficiency of regular expression (regex) queries over log data, poor performance due to the lack of indexing support in existing regex engines, and excessive storage overhead, this paper proposes REI—a lightweight, regex-aware indexing system. REI introduces a novel n-gram–based index tailored for regex workloads, overcoming the limitations of conventional inverted indexes in log analytics. It employs a jointly optimized indexing construction and storage strategy that achieves significant query acceleration while strictly bounding space amplification. Furthermore, REI supports modular integration and is fully compatible with mainstream regex engines without modification. Experimental evaluation demonstrates that REI delivers up to 14× speedup over state-of-the-art unindexed regex engines, with only 2.1% additional storage overhead. These results highlight REI’s effectiveness in enhancing both the timeliness and resource efficiency of large-scale log analysis.
📝 Abstract
In this paper, we present the design and architecture of REI, a novel system for indexing log data for regular expression queries. Our main contribution is an $n$-gram-based indexing strategy and an efficient storage mechanism that results in a speedup of up to 14x compared to state-of-the-art regex processing engines that do not use indexing, using only 2.1% of extra space. We perform a detailed study that analyzes the space usage of the index and the improvement in workload execution time, uncovering interesting insights. Specifically, we show that even an optimized implementation of strategies such as inverted indexing, which are widely used in text processing libraries, may lead to suboptimal performance for regex indexing on log analysis tasks. Overall, the REI approach presented in this paper provides a significant boost when evaluating regular expression queries on log data. REI is also modular and can work with existing regular expression packages, making it easy to deploy in a variety of settings. The code of REI is available at https://github.com/mush-zhang/REI-Regular-Expression-Indexing.