Designing Empirical Studies on LLM-Based Code Generation: Towards a Reference Framework

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current empirical studies on LLM-based code generation suffer from a lack of standardization, resulting in inconsistent task definitions, evaluation objectives, and metrics—severely undermining comparability and reproducibility. To address this, we propose the first systematic empirical research framework specifically designed for LLM code generation. Grounded in three core dimensions—problem source, quality attributes, and evaluation metrics—the framework synthesizes common elements from diverse empirical studies and constructs a modular, extensible assessment system via comparative analysis and theoretical modeling. Validation through representative case-study mappings demonstrates its effectiveness in enhancing experimental design rigor and reporting standardization. Our primary contribution is the establishment of the first domain-specific theoretical framework for empirical research on LLM code generation, providing foundational support for standardizing LLM evaluation in software engineering.

Technology Category

Application Category

📝 Abstract

The rise of large language models (LLMs) has introduced transformative potential in automated code generation, addressing a wide range of software engineering challenges. However, empirical evaluation of LLM-based code generation lacks standardization, with studies varying widely in goals, tasks, and metrics, which limits comparability and reproducibility. In this paper, we propose a theoretical framework for designing and reporting empirical studies on LLM-based code generation. The framework is grounded in both our prior experience conducting such experiments and a comparative analysis of key similarities and differences among recent studies. It organizes evaluation around core components such as problem sources, quality attributes, and metrics, supporting structured and systematic experimentation. We demonstrate its applicability through representative case mappings and identify opportunities for refinement. Looking forward, we plan to evolve the framework into a more robust and mature tool for standardizing LLM evaluation across software engineering contexts.

Problem

Research questions and friction points this paper is trying to address.

Standardizing empirical evaluation of LLM-based code generation

Addressing limited comparability and reproducibility in studies

Proposing framework for systematic experimentation design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes theoretical framework for empirical studies

Organizes evaluation around core components systematically

Supports structured experimentation for standardization

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks