PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection

📅 2024-06-24

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Prior research insufficiently addresses the dual role of large language models (LLMs) in academic integrity—both facilitating and detecting plagiarism. Method: We introduce PlagBench, the first benchmark dataset for LLM-era plagiarism research, comprising 46.5K synthetically generated text pairs covering verbatim copying, paraphrasing, and summarization. Texts are produced by models including GPT-3.5 Turbo and GPT-4 Turbo, and rigorously validated via automated metrics and human annotation. Contribution/Results: We conduct the first systematic evaluation of five open and proprietary LLMs alongside three commercial plagiarism detection tools on both plagiarism generation and detection tasks. Results show that GPT-3.5 Turbo generates high-quality paraphrased and summarized plagiarized content without significantly increasing lexical or syntactic complexity; GPT-4 achieves a 20% average improvement in detection accuracy over other models and commercial tools. This work establishes foundational resources and empirical insights for modeling LLM-driven plagiarism behavior and evaluating detection efficacy.

Technology Category

Application Category

📝 Abstract

Recent studies have raised concerns about the potential threats large language models (LLMs) pose to academic integrity and copyright protection. Yet, their investigation is predominantly focused on literal copies of original texts. Also, how LLMs can facilitate the detection of LLM-generated plagiarism remains largely unexplored. To address these gaps, we introduce extbf{{sf PlagBench}}, a dataset of 46.5K synthetic text pairs that represent three major types of plagiarism: verbatim copying, paraphrasing, and summarization. These samples are generated by three advanced LLMs. We rigorously validate the quality of PlagBench through a combination of fine-grained automatic evaluation and human annotation. We then utilize this dataset for two purposes: (1) to examine LLMs' ability to transform original content into accurate paraphrases and summaries, and (2) to evaluate the plagiarism detection performance of five modern LLMs alongside three specialized plagiarism checkers. Our results show that GPT-3.5 Turbo can produce high-quality paraphrases and summaries without significantly increasing text complexity compared to GPT-4 Turbo. However, in terms of detection, GPT-4 outperforms other LLMs and commercial detection tools by 20%, highlights the evolving capabilities of LLMs not only in content generation but also in plagiarism detection. Data and source code are available at https://github.com/Brit7777/plagbench.

Problem

Research questions and friction points this paper is trying to address.

Exploring LLMs in plagiarism generation and detection

Generating synthetic dataset for plagiarism types

Evaluating LLMs' performance in plagiarism detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for plagiarism generation

PlagBench dataset creation

LLMs for plagiarism detection

🔎 Similar Papers

No similar papers found.