Generative AI in Academic Writing: A Comparison of DeepSeek, Qwen, ChatGPT, Gemini, Llama, Mistral, and Gemma

📅 2025-02-11
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates seven prominent large language models—DeepSeek v3, Qwen 2.5/3, ChatGPT, Gemini, Llama, Mistral, and Gemma—on academic writing performance, focusing on originality, detectability, and readability. Using 40 peer-reviewed papers from digital twin and healthcare domains, we generate text via two paradigms: question-based generation and abstract rewriting. Multidimensional quantitative analysis integrates Turnitin (plagiarism detection), ZeroGPT/GLTR (AI-text classification), BERTScore (semantic similarity), Flesch-Kincaid Grade Level (readability), and lexical frequency metrics. Our work presents the first horizontal benchmark comparing emerging Chinese open-source models (DeepSeek v3, Qwen 3) against leading international counterparts. Results reveal that the rewriting paradigm substantially increases plagiarism risk—100% of outputs exceed Turnitin’s similarity threshold and are uniformly flagged as AI-generated—while also failing to meet standard academic readability requirements. Collectively, findings expose critical bottlenecks in current LLMs for scholarly writing: insufficient originality, high AI detectability, and suboptimal linguistic clarity.

Technology Category

Application Category

📝 Abstract
DeepSeek v3, developed in China, was released in December 2024, followed by Alibaba's Qwen 2.5 Max in January 2025 and Qwen3 235B in April 2025. These free and open-source models offer significant potential for academic writing and content creation. This study evaluates their academic writing performance by comparing them with ChatGPT, Gemini, Llama, Mistral, and Gemma. There is a critical gap in the literature concerning how extensively these tools can be utilized and their potential to generate original content in terms of quality, readability, and effectiveness. Using 40 papers on Digital Twin and Healthcare, texts were generated through AI tools based on posed questions and paraphrased abstracts. The generated content was analyzed using plagiarism detection, AI detection, word count comparisons, semantic similarity, and readability assessments. Results indicate that paraphrased abstracts showed higher plagiarism rates, while question-based responses also exceeded acceptable levels. AI detection tools consistently identified all outputs as AI-generated. Word count analysis revealed that all chatbots produced a sufficient volume of content. Semantic similarity tests showed a strong overlap between generated and original texts. However, readability assessments indicated that the texts were insufficient in terms of clarity and accessibility. This study comparatively highlights the potential and limitations of popular and latest large language models for academic writing. While these models generate substantial and semantically accurate content, concerns regarding plagiarism, AI detection, and readability must be addressed for their effective use in scholarly work.
Problem

Research questions and friction points this paper is trying to address.

Evaluates academic writing performance of multiple AI models
Assesses plagiarism, readability, and semantic similarity in AI-generated content
Identifies limitations in AI tools for scholarly work
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative analysis of multiple AI models
Evaluation using plagiarism and AI detection
Readability and semantic similarity assessments
🔎 Similar Papers
No similar papers found.