🤖 AI Summary
To address progressive entity resolution under time-sensitive and resource-constrained settings, this paper proposes the first systematic progressive entity matching framework, decomposing the process into four stages: filtering, weighting, scheduling, and matching. It introduces a unified design space encompassing mainstream approaches and a novel confidence-prioritized dynamic scheduling mechanism, significantly improving early recall and response efficiency. The framework integrates hybrid (rule- and learning-based) filtering, multi-feature weighting, and configurable matching functions, enabling end-to-end, on-demand retrieval of high-confidence results. Extensive experiments across 10 linkage and 8 deduplication datasets demonstrate that the optimal configuration achieves 90% recall of true duplicate pairs 42% earlier on average, while simultaneously improving both F1 score and time efficiency.
📝 Abstract
Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require a progressive approach that produces results in a pay-as-you-go fashion. Numerous algorithms have been proposed for Progressive ER in the literature. In this work, we propose a novel framework for Progressive Entity Matching that organizes relevant techniques into four consecutive steps: (i) filtering, which reduces the search space to the most likely candidate matches, (ii) weighting, which associates every pair of candidate matches with a similarity score, (iii) scheduling, which prioritizes the execution of the candidate matches so that the real duplicates precede the non-matching pairs, and (iv) matching, which applies a complex, matching function to the pairs in the order defined by the previous step. We associate each step with existing and novel techniques, illustrating that our framework overall generates a superset of the main existing works in the field. We select the most representative combinations resulting from our framework and fine-tune them over 10 established datasets for Record Linkage and 8 for Deduplication, with our results indicating that our taxonomy yields a wide range of high performing progressive techniques both in terms of effectiveness and time efficiency.