PANDAExpress: a Simpler and Faster PANDA Algorithm

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

We address the problem of evaluating conjunctive queries with degree constraints and disjunctive Datalog rules (DDRs). We propose the first algorithmic framework that is both fully general—supporting arbitrary degree bounds, free variables, and Boolean/non-Boolean queries—and asymptotically optimal in time complexity. To eliminate the dominant polylogarithmic overhead inherent in PANDA, we introduce a data-skew-aware dynamic hyperplane partitioning scheme and derive a novel probabilistic inequality. Integrating submodular width theory with explicit degree-constraint modeling, our algorithm achieves the tight complexity bound $ ilde{O}(N^{ ext{subw}})$, matching the precision of specialized algorithms while natively supporting common integrity constraints such as functional dependencies. Experimental results demonstrate substantial performance gains over PANDA, alongside greater simplicity and practical applicability.

Technology Category

Application Category

📝 Abstract

PANDA is a powerful generic algorithm for answering conjunctive queries (CQs) and disjunctive datalog rules (DDRs) given input degree constraints. In the special case where degree constraints are cardinality constraints and the query is Boolean, PANDA runs in $ ilde O (N^{subw})$-time, where $N$ is the input size, and $subw$ is the submodular width of the query, a notion introduced by Daniel Marx (JACM 2013). When specialized to certain classes of sub-graph pattern finding problems, the $ ilde O(N^{subw})$ runtime matches the optimal runtime possible, modulo some conjectures in fine-grained complexity (Bringmann and Gorbachev (STOC 25)). The PANDA framework is much more general, as it handles arbitrary input degree constraints, which capture common statistics and integrity constraints used in relational database management systems, it works for queries with free variables, and for both CQs and DDRs. The key weakness of PANDA is the large $polylog(N)$-factor hidden in the $ ilde O(cdot)$ notation. This makes PANDA completely impractical, and fall short of what is achievable with specialized algorithms. This paper resolves this weakness with two novel ideas. First, we prove a new probabilistic inequality that upper-bounds the output size of DDRs under arbitrary degree constraints. Second, the proof of this inequality directly leads to a new algorithm named PANDAExpress that is both simpler and faster than PANDA. The novel feature of PANDAExpress is a new partitioning scheme that uses arbitrary hyperplane cuts instead of axis-parallel hyperplanes used in PANDA. These hyperplanes are dynamically constructed based on data-skewness statistics carefully tracked throughout the algorithm's execution. As a result, PANDAExpress removes the $polylog(N)$-factor from the runtime of PANDA, matching the runtimes of intricate specialized algorithms, while retaining all its generality and power.

Problem

Research questions and friction points this paper is trying to address.

Reduces polylog factor in PANDA algorithm runtime

Introduces new probabilistic inequality for output size

Uses dynamic hyperplane cuts for data partitioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

New probabilistic inequality for output size bounds

Dynamic hyperplane partitioning based on data-skewness

Removes polylog factor to match specialized algorithm runtimes

🔎 Similar Papers

No similar papers found.