SnipGen: A Mining Repository Framework for Evaluating LLMs for Code

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address evaluation distortion in LLM-based code generation caused by train-test data contamination, this paper proposes the first contamination-free framework for method-level fine-grained assessment. Methodologically: (1) it automatically constructs high-fidelity, non-overlapping test cases by parsing recent GitHub code changes via Abstract Syntax Trees (ASTs); (2) it introduces prompt chaining—a novel paradigm for controllable, fragment-level code generation; and (3) it designs a data-provenance-driven testbed construction mechanism. Contributions include: an open-source, reproducible evaluation toolchain and a high-quality benchmark dataset of 227K test instances derived from 338K commits; significantly enhanced evaluation credibility and interpretability; and support for multi-dimensional performance attribution analysis.

Technology Category

Application Category

📝 Abstract
Language Models (LLMs), such as transformer-based neural networks trained on billions of parameters, have become increasingly prevalent in software engineering (SE). These models, trained on extensive datasets that include code repositories, exhibit remarkable capabilities for SE tasks. However, evaluating their effectiveness poses significant challenges, primarily due to the potential overlap between the datasets used for training and those employed for evaluation. To address this issue, we introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation. SnipGen aims to mitigate data contamination by generating robust testbeds and crafting tailored data points to assist researchers and practitioners in evaluating LLMs for code-related tasks. In our exploratory study, SnipGen mined approximately 227K data points from 338K recent code changes in GitHub commits, focusing on method-level granularity. SnipGen features a collection of prompt templates that can be combined to create a Chain-of-Thought-like sequence of prompts, enabling a nuanced assessment of LLMs' code generation quality. By providing the mining tool, the methodology, and the dataset, SnipGen empowers researchers and practitioners to rigorously evaluate and interpret LLMs' performance in software engineering contexts.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for code tasks
Mitigating data contamination in evaluations
Generating robust testbeds for code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repository mining framework
Prompt engineering application
Data contamination mitigation
🔎 Similar Papers
No similar papers found.