🤖 AI Summary
To address the computational budget and timeliness constraints that hinder traditional batch entity resolution (ER) in large-scale data stream scenarios, this paper proposes a progressive ER framework based on random sampling. Unlike existing approaches that rely on deterministic candidate pair ranking with superlinear complexity, our method is the first to formulate priority assignment as a stochastic maximum bipartite matching problem, integrating probabilistic high-pass filtering and progressive update mechanisms to achieve strictly linear-time ER. This design significantly reduces initialization overhead and enables efficient processing of high-velocity data streams. Extensive experiments across eight real-world datasets demonstrate that our approach achieves 3–6× speedup over state-of-the-art methods while maintaining comparable recall and precision.
📝 Abstract
Entity Resolution (ER) is a critical data cleaning task for identifying records that refer to the same real-world entity. In the era of Big Data, traditional batch ER is often infeasible due to volume and velocity constraints, necessitating Progressive ER methods that maximize recall within a limited computational budget. However, existing progressive approaches fail to scale to high-velocity streams because they rely on deterministic sorting to prioritize candidate pairs, a process that incurs prohibitive super-linear complexity and heavy initialization costs. To address this scalability wall, we introduce SPER (Stochastic Progressive ER), a novel framework that redefines prioritization as a sampling problem rather than a ranking problem. By replacing global sorting with a continuous stochastic bipartite maximization strategy, SPER acts as a probabilistic high-pass filter that selects high-utility pairs in strictly linear time. Extensive experiments on eight real-world datasets demonstrate that SPER achieves significant speedups (3x to 6x) over state-of-the-art baselines while maintaining comparable recall and precision.