🤖 AI Summary
This work proposes CausalSE, a novel framework that systematically integrates structural causal models (SCMs) with propensity score matching to rigorously identify the true causal effects of interventions—such as prompt engineering—on large language model code generation performance. Addressing a critical limitation in traditional software engineering empirical studies, which often rely on statistical associations vulnerable to confounding bias, this study introduces Pearl’s causal inference paradigm into the field. Empirical evaluation on the Galeras dataset reveals that while conventional association-based analyses suggest complex prompts improve performance, causal analysis under CausalSE finds no significant treatment effect, thereby exposing false-positive conclusions arising from unaccounted confounders. The paper further provides a reproducible methodology for causal inference in software engineering contexts.
📝 Abstract
Causal Inference offers a fundamental approach for advancing empirical software engineering (ESE) beyond traditional statistical association, enabling researchers to rigorously identify and quantify causal relationships in software experiments. This paper introduces CausalSE, a framework that operationalizes Judea Pearl's causal inference paradigm in ESE context. The paper focuses on Structural Causal Models (SCMs) to address the limitations of classical statistical methods in mitigating confounding bias. Through a case study using the Galeras dataset and propensity score matching, we demonstrate how CausalSE disentangles the effect of prompt engineering strategies on code generation outcomes in a popular LLM (i.e., GPT-3). The results reveal that while associational analyses can suggest improvements in certain interventions (e.g., more complex prompts), causal analysis often does not find a significant treatment effect, highlighting the risk of false positives when confounding is not addressed. By providing a tutorial-based methodology and a real-world case study, this work equips software researchers with practical tools to design, analyze, and interpret software experiments with methodological rigor, ultimately enabling more informed and actionable conclusions in both research and practice.