🤖 AI Summary
Existing code generation research lacks high-quality, reasoning-oriented evaluation benchmarks and integrated training–evaluation platforms. Method: We introduce the first timestamped (July 2024) LeetCode Python dataset, featuring rich metadata and ≥100 test cases per problem—enabling contamination-free evaluation and efficient supervised fine-tuning (SFT). We propose a timestamp-based contamination isolation protocol, achieving SFT efficiency comparable to 110K human-written solutions using only 2.6K model-generated solutions. We further open-source end-to-end tooling for data cleaning, time-aware dataset splitting, SFT, and reasoning-focused evaluation. Contribution/Results: Our reasoning-enhanced model achieves +12.3% absolute improvement on HumanEval and MBPP. The dataset and framework are publicly released on Hugging Face and GitHub.
📝 Abstract
We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.