Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs

📅 2024-10-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of identifying *semantic duplication*—distinct from literal repetition—in economics literature titles. Methodologically, it introduces the first integrated framework combining sBERT-based semantic embeddings with traditional similarity metrics (e.g., Levenshtein distance and cosine similarity), augmented by NLP preprocessing and LLM-assisted verification; a manually annotated ground-truth dataset ensures rigorous evaluation. Empirical results reveal that semantic duplication is relatively rare in economics titles; sBERT substantially outperforms string-matching baselines, demonstrating strong robustness and high agreement with human judgments (Cohen’s κ > 0.85). The work establishes the first domain-specific, semantically grounded evaluation framework for duplicate detection in economics titles, accompanied by a fully reproducible technical pipeline—thereby advancing methodological rigor in scholarly metadata curation and bibliometric analysis.

Technology Category

Application Category

📝 Abstract
This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.
Problem

Research questions and friction points this paper is trying to address.

Evaluate deduplication techniques for economic paper titles
Compare semantic similarity using NLP and LLM methods
Assess duplicate prevalence with human-annotated validation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses NLP and LLMs for semantic similarity
Applies Levenshtein and cosine distance measures
Incorporates sBERT model for semantic evaluation
🔎 Similar Papers
No similar papers found.
D
Doohee You
The World Bank
S
Samuel P. Fraiberger
The World Bank, New York University