A Comparison of DeepSeek and Other LLMs

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates DeepSeek’s performance against Claude, Gemini, GPT, and Llama on authorship attribution and citation type classification. We propose an LLM-driven, MADStat-constrained controllable data synthesis paradigm to construct a high-quality, manually annotated benchmark dataset for academic text classification. Conducting the first multi-dimensional comparative evaluation, we assess models across supervised fine-tuning, BERTScore and cosine similarity for semantic fidelity, and cost–latency–accuracy trade-offs. Key contributions include: (1) releasing the first open-source, fine-grained academic text classification benchmark with human annotations; (2) establishing a reproducible, controllable methodology for LLM-based synthetic data generation; and (3) empirical findings showing DeepSeek achieves higher accuracy than Gemini, GPT, and Llama—though marginally below Claude—while incurring the lowest inference cost yet exhibiting slower latency, and producing outputs semantically closest to Claude and Gemini.

Technology Category

Application Category

📝 Abstract
Recently, DeepSeek has been the focus of attention in and beyond the AI community. An interesting problem is how DeepSeek compares to other large language models (LLMs). There are many tasks an LLM can do, and in this paper, we use the task of predicting an outcome using a short text for comparison. We consider two settings, an authorship classification setting and a citation classification setting. In the first one, the goal is to determine whether a short text is written by human or AI. In the second one, the goal is to classify a citation to one of four types using the textual content. For each experiment, we compare DeepSeek with $4$ popular LLMs: Claude, Gemini, GPT, and Llama. We find that, in terms of classification accuracy, DeepSeek outperforms Gemini, GPT, and Llama in most cases, but underperforms Claude. We also find that DeepSeek is comparably slower than others but with a low cost to use, while Claude is much more expensive than all the others. Finally, we find that in terms of similarity, the output of DeepSeek is most similar to those of Gemini and Claude (and among all $5$ LLMs, Claude and Gemini have the most similar outputs). In this paper, we also present a fully-labeled dataset collected by ourselves, and propose a recipe where we can use the LLMs and a recent data set, MADStat, to generate new data sets. The datasets in our paper can be used as benchmarks for future study on LLMs.
Problem

Research questions and friction points this paper is trying to address.

Compare DeepSeek with other LLMs
Evaluate LLMs on classification tasks
Propose new datasets for LLM benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares DeepSeek with four LLMs
Uses authorship and citation classification
Proposes new LLM-generated datasets recipe
🔎 Similar Papers
No similar papers found.