🤖 AI Summary
This work addresses the challenge of efficiently estimating correlation clustering cost in the node-arrival streaming model, where existing methods fall short. We propose C⁴Approx, the first algorithm tailored to this setting, which integrates streaming techniques, sublinear-space data structures, and graph similarity functions to achieve high-accuracy approximation using only constant rounds and sublinear space—storing merely about 2% of the nodes. We provide theoretical lower bounds complementing our approach and demonstrate through experiments that C⁴Approx matches the performance of classic Pivot and PrunedPivot algorithms, even on sparse graphs. This represents a significant advance beyond the limitations of traditional edge-stream models.
📝 Abstract
We study the correlation clustering problem in the node-arrival data stream model. Unlike previous work, where the stream consists of the graph's edges, we focus on the setting in which the stream contains only the nodes. This model better reflects many real-world scenarios in which the data stream naturally consists of raw objects (e.g., images, tweets), and the similar/dissimilar edges are derived through a similarity function. We present C$^4$Approx, a streaming algorithm that approximates the cost of correlation clustering using sublinear space in the number of nodes and a constant number of passes. We further complement this result with lower bounds. Experiments on real-world datasets show that by storing only 2% of the nodes, our algorithm achieves performance comparable to the classic Pivot algorithm and the more recent PrunedPivot algorithm, even on sparse graphs.