Designing Empirical Studies on LLM-Based Code Generation: Towards a Reference Framework

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current empirical studies on LLM-based code generation suffer from a lack of standardization, resulting in inconsistent task definitions, evaluation objectives, and metrics—severely undermining comparability and reproducibility. To address this, we propose the first systematic empirical research framework specifically designed for LLM code generation. Grounded in three core dimensions—problem source, quality attributes, and evaluation metrics—the framework synthesizes common elements from diverse empirical studies and constructs a modular, extensible assessment system via comparative analysis and theoretical modeling. Validation through representative case-study mappings demonstrates its effectiveness in enhancing experimental design rigor and reporting standardization. Our primary contribution is the establishment of the first domain-specific theoretical framework for empirical research on LLM code generation, providing foundational support for standardizing LLM evaluation in software engineering.

Technology Category

Application Category

📝 Abstract
The rise of large language models (LLMs) has introduced transformative potential in automated code generation, addressing a wide range of software engineering challenges. However, empirical evaluation of LLM-based code generation lacks standardization, with studies varying widely in goals, tasks, and metrics, which limits comparability and reproducibility. In this paper, we propose a theoretical framework for designing and reporting empirical studies on LLM-based code generation. The framework is grounded in both our prior experience conducting such experiments and a comparative analysis of key similarities and differences among recent studies. It organizes evaluation around core components such as problem sources, quality attributes, and metrics, supporting structured and systematic experimentation. We demonstrate its applicability through representative case mappings and identify opportunities for refinement. Looking forward, we plan to evolve the framework into a more robust and mature tool for standardizing LLM evaluation across software engineering contexts.
Problem

Research questions and friction points this paper is trying to address.

Standardizing empirical evaluation of LLM-based code generation
Addressing limited comparability and reproducibility in studies
Proposing framework for systematic experimentation design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes theoretical framework for empirical studies
Organizes evaluation around core components systematically
Supports structured experimentation for standardization
🔎 Similar Papers
No similar papers found.