SPER: Accelerating Progressive Entity Resolution via Stochastic Bipartite Maximization

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the computational budget and timeliness constraints that hinder traditional batch entity resolution (ER) in large-scale data stream scenarios, this paper proposes a progressive ER framework based on random sampling. Unlike existing approaches that rely on deterministic candidate pair ranking with superlinear complexity, our method is the first to formulate priority assignment as a stochastic maximum bipartite matching problem, integrating probabilistic high-pass filtering and progressive update mechanisms to achieve strictly linear-time ER. This design significantly reduces initialization overhead and enables efficient processing of high-velocity data streams. Extensive experiments across eight real-world datasets demonstrate that our approach achieves 3–6× speedup over state-of-the-art methods while maintaining comparable recall and precision.

Technology Category

Application Category

📝 Abstract
Entity Resolution (ER) is a critical data cleaning task for identifying records that refer to the same real-world entity. In the era of Big Data, traditional batch ER is often infeasible due to volume and velocity constraints, necessitating Progressive ER methods that maximize recall within a limited computational budget. However, existing progressive approaches fail to scale to high-velocity streams because they rely on deterministic sorting to prioritize candidate pairs, a process that incurs prohibitive super-linear complexity and heavy initialization costs. To address this scalability wall, we introduce SPER (Stochastic Progressive ER), a novel framework that redefines prioritization as a sampling problem rather than a ranking problem. By replacing global sorting with a continuous stochastic bipartite maximization strategy, SPER acts as a probabilistic high-pass filter that selects high-utility pairs in strictly linear time. Extensive experiments on eight real-world datasets demonstrate that SPER achieves significant speedups (3x to 6x) over state-of-the-art baselines while maintaining comparable recall and precision.
Problem

Research questions and friction points this paper is trying to address.

SPER accelerates entity resolution for big data streams
It replaces deterministic sorting with stochastic bipartite maximization
The method reduces complexity to linear time while maintaining accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic bipartite maximization replaces deterministic sorting
Sampling-based prioritization enables linear-time pair selection
Probabilistic high-pass filter accelerates entity resolution scalability
🔎 Similar Papers
2024-07-31Annual Meeting of the Association for Computational LinguisticsCitations: 7