On Aggregation Queries over Predicted Nearest Neighbors

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This paper addresses the efficiency–accuracy trade-off in predictive approximate nearest-neighbor aggregation queries (AQNNs), where expensive prediction models (e.g., deep neural networks or expert judgments) hinder scalability. We propose SPRinT, a framework that decouples and jointly optimizes three stages: sampling, neighbor refinement, and aggregation. SPRinT employs confidence-aware probabilistic sampling, replaces costly predictors with lightweight surrogate models, and introduces a theoretically grounded aggregation estimator with provable bounds on sample complexity and estimation error. The method supports arbitrary aggregation functions and enables tunable accuracy–efficiency trade-offs. Extensive evaluation on medical, e-commerce, and video datasets demonstrates that SPRinT significantly reduces both aggregation error and computational overhead. Scalability experiments confirm stable query latency and bounded error growth under increasing data scale, validating its suitability for large-scale, real-time applications.

Technology Category

Application Category

📝 Abstract

We introduce Aggregation Queries over Nearest Neighbors (AQNNs), a novel type of aggregation queries over the predicted neighborhood of a designated object. AQNNs are prevalent in modern applications where, for instance, a medical professional may want to compute"the average systolic blood pressure of patients whose predicted condition is similar to a given insomnia patient". Since prediction typically involves an expensive deep learning model or a human expert, we formulate query processing as the problem of returning an approximate aggregate by combining an expensive oracle and a cheaper model (e.g, a simple ML model) to compute the predictions. We design the Sampler with Precision-Recall in Target (SPRinT) framework for answering AQNNs. SPRinT consists of sampling, nearest neighbor refinement, and aggregation, and is tailored for various aggregation functions. It enjoys provable theoretical guarantees, including bounds on sample size and on error in approximate aggregates. Our extensive experiments on medical, e-commerce, and video datasets demonstrate that SPRinT consistently achieves the lowest aggregation error with minimal computation cost compared to its baselines. Scalability results show that SPRinT's execution time and aggregation error remain stable as the dataset size increases, confirming its suitability for large-scale applications.

Problem

Research questions and friction points this paper is trying to address.

Aggregation Queries over Nearest Neighbors

Combining expensive oracle and cheaper model

SPRinT framework for scalable, accurate aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

AQNNs aggregate predicted neighbor data

SPRinT combines expensive and cheap models

SPRinT ensures low error and scalability

🔎 Similar Papers

Beyond Check-in Counts: Redefining Popularity for POI Recommendation with Users and Recency