🤖 AI Summary
This paper studies the range entropy query problem on geometric data: given an axis-aligned hyperrectangular query region, efficiently compute the Shannon or Rényi entropy of the weighted, colored point set contained therein. We formally define the problem and establish conditional lower bounds on the space–time trade-off. Our method introduces two sublinear-indexing approaches: (1) an exact data structure achieving $o(n^{2d})$ space and $o(n)$ query time in $d$ dimensions; and (2) an approximate structure using near-linear space to support entropy estimation with controllable additive or multiplicative error. The design integrates computational geometry indexing, multidimensional range aggregation, hierarchical grid partitioning, and weighted color frequency statistics. Our results overcome the fundamental limitation of classical entropy computation—its inability to respond to range predicates—and enable applications including data partitioning, compression, and cardinality estimation.
📝 Abstract
Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the R'enyi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to compute the entropy in different subsets of data when the algorithm needs to decide what block to construct. Such a data structure will also be useful for data analysts exploring different subsets of data to identify areas of interest. While it is generally known how to compute the Shannon or the R'enyi entropy of a discrete distribution in the offline or streaming setting efficiently, we focus on the query setting where we aim to efficiently derive the entropy among a subset of data that satisfy some linear predicates. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set $P$ of $n$ weighted and colored points in $mathbb{R}^d$. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle $R$, it computes the Shannon (resp. R'enyi) entropy based on the colors and the weights of the points in $Pcap R$, in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for $d=1$ and $d>1$ with $o(n^{2d})$ space and $o(n)$ query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. R'enyi) entropy in $Pcap R$.