Near-Duplicate Text Alignment under Weighted Jaccard Similarity

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing methods for approximate duplicate detection over weighted subsequences (e.g., TF-IDF-weighted tokens) in large corpora suffer from lack of theoretical accuracy guarantees, poor parameter sensitivity, and failure to account for token importance. This paper proposes MONO, the first efficient subsequence retrieval framework supporting weighted Jaccard similarity. Its core innovation integrates consistent weighted sampling into the MinHash structure, achieving theoretically optimal grouping complexity and providing a rigorously proven tight lower bound on estimation error. Extensive experiments demonstrate that MONO achieves high precision while accelerating index construction by 26×, reducing index size by 30%, and cutting query latency by 3×, all while exhibiting strong scalability across diverse workloads and data scales.

Technology Category

Application Category

📝 Abstract

Near-duplicate text alignment is the task of identifying, among the texts in a corpus, all the subsequences (substrings) that are similar to a given query. Traditional approaches rely on seeding-extension-filtering heuristics, which lack accuracy guarantees and require many hard-to-tune parameters. Recent methods leverage min-hash techniques under a hash-based framework: group subsequences by their min-hash, and for any query, find all sketches similar to the query's sketch. These methods guarantee to report all subsequences whose estimated unweighted Jaccard similarity with the query exceeds a user-provided threshold and are efficient. However, they fail to account for token importance or frequency, which limits their use in real scenarios where tokens carry weights, such as TF-IDF. To address this, we propose MONO, an approach that supports weighted Jaccard similarity using consistent weighted sampling. MONO achieves optimality within the hash-based framework. For example, when token weights are proportional to frequencies, MONO generates O(n + n log f) groups in expectation for a text of length n, where f is the maximum token frequency. Each group takes O(1) space and represents a few subsequences sharing the same sampling. We prove this bound is tight: any algorithm must produce Omega(n + n log f) groups in expectation in the worst case. Experiments show that MONO outperforms the state of the art by up to 26x in index construction time, reduces index size by up to 30 percent, and improves query latency by up to 3x, while scaling well.

Problem

Research questions and friction points this paper is trying to address.

Aligning near-duplicate texts with weighted token importance

Overcoming limitations of unweighted Jaccard similarity methods

Supporting weighted Jaccard similarity using consistent sampling technique

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted Jaccard similarity via consistent sampling

Optimal hash-based framework with tight bounds

Efficient index construction and query latency improvement

🔎 Similar Papers

Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs