🤖 AI Summary
This study addresses the challenge of identifying *semantic duplication*—distinct from literal repetition—in economics literature titles. Methodologically, it introduces the first integrated framework combining sBERT-based semantic embeddings with traditional similarity metrics (e.g., Levenshtein distance and cosine similarity), augmented by NLP preprocessing and LLM-assisted verification; a manually annotated ground-truth dataset ensures rigorous evaluation. Empirical results reveal that semantic duplication is relatively rare in economics titles; sBERT substantially outperforms string-matching baselines, demonstrating strong robustness and high agreement with human judgments (Cohen’s κ > 0.85). The work establishes the first domain-specific, semantically grounded evaluation framework for duplicate detection in economics titles, accompanied by a fully reproducible technical pipeline—thereby advancing methodological rigor in scholarly metadata curation and bibliometric analysis.
📝 Abstract
This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.