🤖 AI Summary
This work addresses the challenge of performing $k$-clustering when only noisy quadruple comparisons—such as those derived from human feedback or large language models—are available, rendering traditional methods that rely on exact distances inapplicable. The paper introduces the first randomized algorithm that operates under a noisy quadruple comparison oracle, achieving a constant-factor approximation in arbitrary metric spaces and a $(1+\varepsilon)$-approximation in low doubling-dimension spaces. Notably, it reduces query complexity to $O(nk\cdot\text{polylog}(n))$ for general metric spaces and to $O((n+k^2)\cdot\text{polylog}(n))$ in low doubling-dimension settings. The framework further enables the systematic integration of low-cost, noisy oracles into the clustering pipeline, offering a practical and theoretically grounded approach for similarity-based clustering under limited and imperfect information.
📝 Abstract
Clustering is a fundamental primitive in unsupervised learning. However, classical algorithms for $k$-clustering (such as $k$-median and $k$-means) assume access to exact pairwise distances -- an unrealistic requirement in many modern applications. We study clustering in the \emph{Rank-model (R-model)}, where access to distances is entirely replaced by a \emph{quadruplet oracle} that provides only relative distance comparisons. In practice, such an oracle can represent learned models or human feedback, and is expected to be noisy and entail an access cost. Given a metric space with $n$ input items, we design randomized algorithms that, using only a noisy quadruplet oracle, compute a set of $O(k \cdot \mathsf{polylog}(n))$ centers along with a mapping from the input items to the centers such that the clustering cost of the mapping is at most constant times the optimum $k$-clustering cost. Our method achieves a query complexity of $O(n\cdot k \cdot \mathsf{polylog}(n))$ for arbitrary metric spaces and improves to $O((n+k^2) \cdot \mathsf{polylog}(n))$ when the underlying metric has bounded doubling dimension. When the metric has bounded doubling dimension we can further improve the approximation from constant to $1+\varepsilon$, for any arbitrarily small constant $\varepsilon\in(0,1)$, while preserving the same asymptotic query complexity. Our framework demonstrates how noisy, low-cost oracles, such as those derived from large language models, can be systematically integrated into scalable clustering algorithms.