🤖 AI Summary
This paper investigates the membership testing problem for semantic regular expressions (SemREs)—determining whether a string matches a pattern that incorporates external oracles (e.g., LLMs or databases). To address the high cost of oracle queries, we propose the first theoretically sound two-pass NFA-based algorithm supporting state expansion and backtracking, optimized for non-nested oracle calls. We establish the first complexity-theoretic connection between SemRE membership testing and the triangle detection problem, proving a tight Ω(|w|²) lower bound on oracle query complexity. Our algorithm achieves O(|r|²|w|²) time complexity for non-nested SemREs. Experiments show it outperforms dynamic programming baselines by over an order of magnitude, while incurring oracle interaction overhead only approximately twice the theoretical lower bound—demonstrating near-optimal efficiency.
📝 Abstract
SMORE (Chen et al., 2023) recently proposed the concept of semantic regular expressions that extend the classical formalism with a primitive to query external oracles such as databases and large language models (LLMs). Such patterns can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study the membership testing problem: First, We present a two-pass NFA-based algorithm to determine whether a string $w$ matches a semantic regular expression (SemRE) $r$ in $O(|r|^2 |w|^2 + |r| |w|^3)$ time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs in $O(|r|^2 |w|^2)$ time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a $approx 2 imes$ overhead over the time needed for interaction with the oracle. Next, We establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an $Omega(|w|^2)$ lower bound on the number of oracle queries necessary to make this determination.