Overview of the Plagiarism Detection Task at PAN 2025

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing LLM-generated text plagiarism detection methods exhibit poor generalization across temporal domains, particularly when source texts predate the training data of modern LLMs. Method: We construct a large-scale generative plagiarism benchmark covering multiple LLM generations (Llama, DeepSeek-R1, Mistral) and propose a semantic embedding similarity–based alignment detection framework, rigorously evaluating diverse baseline models. Contribution/Results: The best-performing method achieves 80% recall and 50% precision on mainstream test sets; however, performance degrades significantly—by over 40 percentage points in recall—on pre-2015 scientific literature. This reveals a critical temporal generalization bottleneck in current LLM plagiarism detectors, providing the first empirical evidence of their failure under time-shifted conditions. Our benchmark and analysis establish a foundational resource for developing robust, temporally adaptive academic integrity tools capable of detecting plagiarism across evolving knowledge timelines.

Technology Category

Application Category

📝 Abstract

The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.

Problem

Research questions and friction points this paper is trying to address.

Detecting AI-generated plagiarism in scientific articles

Creating large-scale dataset using three language models

Evaluating detection methods' robustness across different datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used three large language models to generate plagiarism

Created novel large-scale dataset for detection task

Evaluated approaches using embedding-based semantic similarity

🔎 Similar Papers

PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection