🤖 AI Summary
We address the problem of evaluating conjunctive queries with degree constraints and disjunctive Datalog rules (DDRs). We propose the first algorithmic framework that is both fully general—supporting arbitrary degree bounds, free variables, and Boolean/non-Boolean queries—and asymptotically optimal in time complexity. To eliminate the dominant polylogarithmic overhead inherent in PANDA, we introduce a data-skew-aware dynamic hyperplane partitioning scheme and derive a novel probabilistic inequality. Integrating submodular width theory with explicit degree-constraint modeling, our algorithm achieves the tight complexity bound $ ilde{O}(N^{ ext{subw}})$, matching the precision of specialized algorithms while natively supporting common integrity constraints such as functional dependencies. Experimental results demonstrate substantial performance gains over PANDA, alongside greater simplicity and practical applicability.
📝 Abstract
PANDA is a powerful generic algorithm for answering conjunctive queries (CQs) and disjunctive datalog rules (DDRs) given input degree constraints. In the special case where degree constraints are cardinality constraints and the query is Boolean, PANDA runs in $ ilde O (N^{subw})$-time, where $N$ is the input size, and $subw$ is the submodular width of the query, a notion introduced by Daniel Marx (JACM 2013). When specialized to certain classes of sub-graph pattern finding problems, the $ ilde O(N^{subw})$ runtime matches the optimal runtime possible, modulo some conjectures in fine-grained complexity (Bringmann and Gorbachev (STOC 25)). The PANDA framework is much more general, as it handles arbitrary input degree constraints, which capture common statistics and integrity constraints used in relational database management systems, it works for queries with free variables, and for both CQs and DDRs.
The key weakness of PANDA is the large $polylog(N)$-factor hidden in the $ ilde O(cdot)$ notation. This makes PANDA completely impractical, and fall short of what is achievable with specialized algorithms. This paper resolves this weakness with two novel ideas. First, we prove a new probabilistic inequality that upper-bounds the output size of DDRs under arbitrary degree constraints. Second, the proof of this inequality directly leads to a new algorithm named PANDAExpress that is both simpler and faster than PANDA. The novel feature of PANDAExpress is a new partitioning scheme that uses arbitrary hyperplane cuts instead of axis-parallel hyperplanes used in PANDA. These hyperplanes are dynamically constructed based on data-skewness statistics carefully tracked throughout the algorithm's execution. As a result, PANDAExpress removes the $polylog(N)$-factor from the runtime of PANDA, matching the runtimes of intricate specialized algorithms, while retaining all its generality and power.