🤖 AI Summary
This paper addresses the Direct Access to Ranked Retrieval (DAR) problem in interactive data exploration—efficiently retrieving the tuple at a specified rank position over high-dimensional, large-scale datasets without full enumeration. We propose Conformal Set Ranked Retrieval (CSR), a novel paradigm that replaces exact tuple identification with *guaranteed sets*: compact subsets provably containing the target-ranked tuple. To support CSR, we design a hierarchical index structure integrating geometric partitioning and ε-sampling, coupled with Stripe Range Retrieval (SRR) for modeling narrow-range rank queries. We theoretically establish that CSR achieves near-optimal query complexity. Experiments on million-tuple, hundred-dimensional datasets demonstrate superior scalability and sub-second response times, significantly advancing the efficiency of interactive ranked access in high-dimensional big data scenarios.
📝 Abstract
We study the problem of Direct-Access Ranked Retrieval (DAR) for interactive data tooling, where evolving data exploration practices, combined with large-scale and high-dimensional datasets, create new challenges. DAR concerns the problem of enabling efficient access to arbitrary rank positions according to a ranking function, without enumerating all preceding tuples. To address this need, we formalize the DAR problem and propose a theoretically efficient algorithm based on geometric arrangements, achieving logarithmic query time. However, this method suffers from exponential space complexity in high dimensions. Therefore, we develop a second class of algorithms based on $varepsilon$-sampling, which consume a linear space. Since exactly locating the tuple at a specific rank is challenging due to its connection to the range counting problem, we introduce a relaxed variant called Conformal Set Ranked Retrieval (CSR), which returns a small subset guaranteed to contain the target tuple. To solve the CSR problem efficiently, we define an intermediate problem, Stripe Range Retrieval (SRR), and design a hierarchical sampling data structure tailored for narrow-range queries. Our method achieves practical scalability in both data size and dimensionality. We prove near-optimal bounds on the efficiency of our algorithms and validate their performance through extensive experiments on real and synthetic datasets, demonstrating scalability to millions of tuples and hundreds of dimensions.