Progressive Entity Matching: A Design Space Exploration

📅 2025-02-10
🏛️ Proceedings of the ACM on Management of Data
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address progressive entity resolution under time-sensitive and resource-constrained settings, this paper proposes the first systematic progressive entity matching framework, decomposing the process into four stages: filtering, weighting, scheduling, and matching. It introduces a unified design space encompassing mainstream approaches and a novel confidence-prioritized dynamic scheduling mechanism, significantly improving early recall and response efficiency. The framework integrates hybrid (rule- and learning-based) filtering, multi-feature weighting, and configurable matching functions, enabling end-to-end, on-demand retrieval of high-confidence results. Extensive experiments across 10 linkage and 8 deduplication datasets demonstrate that the optimal configuration achieves 90% recall of true duplicate pairs 42% earlier on average, while simultaneously improving both F1 score and time efficiency.

Technology Category

Application Category

📝 Abstract
Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require a progressive approach that produces results in a pay-as-you-go fashion. Numerous algorithms have been proposed for Progressive ER in the literature. In this work, we propose a novel framework for Progressive Entity Matching that organizes relevant techniques into four consecutive steps: (i) filtering, which reduces the search space to the most likely candidate matches, (ii) weighting, which associates every pair of candidate matches with a similarity score, (iii) scheduling, which prioritizes the execution of the candidate matches so that the real duplicates precede the non-matching pairs, and (iv) matching, which applies a complex, matching function to the pairs in the order defined by the previous step. We associate each step with existing and novel techniques, illustrating that our framework overall generates a superset of the main existing works in the field. We select the most representative combinations resulting from our framework and fine-tune them over 10 established datasets for Record Linkage and 8 for Deduplication, with our results indicating that our taxonomy yields a wide range of high performing progressive techniques both in terms of effectiveness and time efficiency.
Problem

Research questions and friction points this paper is trying to address.

Progressive Entity Resolution for time-constrained applications
Framework with filtering, weighting, scheduling, and matching steps
Evaluation on datasets for Record Linkage and Deduplication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive ER framework with four steps
Filters, weights, schedules, matches candidate pairs
Fine-tuned on 18 datasets for high performance
🔎 Similar Papers
No similar papers found.