LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing code generation research lacks high-quality, reasoning-oriented evaluation benchmarks and integrated training–evaluation platforms. Method: We introduce the first timestamped (July 2024) LeetCode Python dataset, featuring rich metadata and ≥100 test cases per problem—enabling contamination-free evaluation and efficient supervised fine-tuning (SFT). We propose a timestamp-based contamination isolation protocol, achieving SFT efficiency comparable to 110K human-written solutions using only 2.6K model-generated solutions. We further open-source end-to-end tooling for data cleaning, time-aware dataset splitting, SFT, and reasoning-focused evaluation. Contribution/Results: Our reasoning-enhanced model achieves +12.3% absolute improvement on HumanEval and MBPP. The dataset and framework are publicly released on Hugging Face and GitHub.

Technology Category

Application Category

📝 Abstract

We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.

Problem

Research questions and friction points this paper is trying to address.

Lack of reasoning-focused coding benchmarks for LLMs

Need for self-contained training testbeds in code-generation models

Challenges in contamination-free evaluation and efficient fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

LeetCodeDataset with rich metadata

Temporal splits for contamination-free evaluation

Efficient SFT with model-generated solutions

🔎 Similar Papers

No similar papers found.

Authors to Follow