Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study addresses the challenge of identifying *semantic duplication*—distinct from literal repetition—in economics literature titles. Methodologically, it introduces the first integrated framework combining sBERT-based semantic embeddings with traditional similarity metrics (e.g., Levenshtein distance and cosine similarity), augmented by NLP preprocessing and LLM-assisted verification; a manually annotated ground-truth dataset ensures rigorous evaluation. Empirical results reveal that semantic duplication is relatively rare in economics titles; sBERT substantially outperforms string-matching baselines, demonstrating strong robustness and high agreement with human judgments (Cohen’s κ > 0.85). The work establishes the first domain-specific, semantically grounded evaluation framework for duplicate detection in economics titles, accompanied by a fully reproducible technical pipeline—thereby advancing methodological rigor in scholarly metadata curation and bibliometric analysis.

Technology Category

Application Category

📝 Abstract

This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.

Problem

Research questions and friction points this paper is trying to address.

Evaluate deduplication techniques for economic paper titles

Compare semantic similarity using NLP and LLM methods

Assess duplicate prevalence with human-annotated validation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses NLP and LLMs for semantic similarity

Applies Levenshtein and cosine distance measures

Incorporates sBERT model for semantic evaluation

🔎 Similar Papers

No similar papers found.