ApproxJoin: Approximate Matching for Efficient Verification in Fuzzy Set Similarity Join

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Exact matching algorithms—such as the Hungarian algorithm—used in the verification phase of fuzzy set similarity joins incur prohibitively high computational overhead. Method: This paper proposes an efficient verification framework based on Approximate Maximum Weight Matching (AMWM), integrating three AMWM strategies—greedy matching, local-dominance pruning, and the Paz–Schwartzman method—into the verification stage within a filter-verification architecture to rapidly prune candidate pairs. Contribution/Results: The framework achieves a high recall of 99% while drastically reducing matching complexity. Experimental evaluation demonstrates that it outperforms the state-of-the-art methods by 2×–19× in end-to-end runtime, striking a favorable balance between efficiency and accuracy. By avoiding expensive exact bipartite matching, the approach establishes a scalable new paradigm for large-scale fuzzy set joins.

Technology Category

Application Category

📝 Abstract
The set similarity join problem is a fundamental problem in data processing and discovery, relying on exact similarity measures between sets. In the presence of alterations, such as misspellings on string data, the fuzzy set similarity join problem instead approximately matches pairs of elements based on the maximum weighted matching of the bipartite graph representation of sets. State-of-the-art methods within this domain improve performance through efficient filtering methods within the filter-verify framework, primarily to offset high verification costs induced by the usage of the Hungarian algorithm - an optimal matching method. Instead, we directly target the verification process to assess the efficacy of more efficient matching methods within candidate pair pruning. We present ApproxJoin, the first work of its kind in applying approximate maximum weight matching algorithms for computationally expensive fuzzy set similarity join verification. We comprehensively test the performance of three approximate matching methods: the Greedy, Locally Dominant and Paz Schwartzman methods, and compare with the state-of-the-art approach using exact matching. Our experimental results show that ApproxJoin yields performance improvements of 2-19x the state-of-the-art with high accuracy (99% recall).
Problem

Research questions and friction points this paper is trying to address.

Efficient verification for fuzzy set similarity joins
Reducing high verification costs with approximate matching
Improving performance in set similarity join processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses approximate matching for fuzzy joins
Compares three efficient matching methods
Achieves high accuracy with 99% recall
🔎 Similar Papers
2024-04-15Annual Meeting of the Association for Computational LinguisticsCitations: 4