LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

📅 2025-04-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation research lacks high-quality, reasoning-oriented evaluation benchmarks and integrated training–evaluation platforms. Method: We introduce the first timestamped (July 2024) LeetCode Python dataset, featuring rich metadata and ≥100 test cases per problem—enabling contamination-free evaluation and efficient supervised fine-tuning (SFT). We propose a timestamp-based contamination isolation protocol, achieving SFT efficiency comparable to 110K human-written solutions using only 2.6K model-generated solutions. We further open-source end-to-end tooling for data cleaning, time-aware dataset splitting, SFT, and reasoning-focused evaluation. Contribution/Results: Our reasoning-enhanced model achieves +12.3% absolute improvement on HumanEval and MBPP. The dataset and framework are publicly released on Hugging Face and GitHub.

Technology Category

Application Category

📝 Abstract
We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.
Problem

Research questions and friction points this paper is trying to address.

Lack of reasoning-focused coding benchmarks for LLMs
Need for self-contained training testbeds in code-generation models
Challenges in contamination-free evaluation and efficient fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

LeetCodeDataset with rich metadata
Temporal splits for contamination-free evaluation
Efficient SFT with model-generated solutions
🔎 Similar Papers
No similar papers found.
Y
Yunhui Xia
W
Wei Shen
Y
Yan Wang
J
Jason Klein Liu
H
Huifeng Sun
S
Siyue Wu
J
Jian Hu
Xiaolong Xu
Xiaolong Xu
2019~2025 Ant Group/2025~Now ByteDance
Graph Neural NetworksKnowledge GraphFederated Learning