Performance Evaluation of Large Language Models in Statistical Programming

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large language models’ (LLMs) capability to generate correct, executable, and statistically sound SAS code remains an open challenge, particularly for complex multivariate statistical analyses. Method: We systematically assess ChatGPT (v3.5/v4) and Llama on SAS programming tasks, introducing the first standardized, statistics-oriented SAS benchmark—comprising real-world datasets, structured problem specifications, data documentation, and expert-validated reference implementations. We employ a multidimensional human evaluation framework assessing five dimensions: syntactic correctness, executability, statistical semantic accuracy, result reliability, and code readability. Contribution/Results: While all models consistently produce syntactically valid code, they exhibit high error rates and substantial redundancy in tasks requiring deep statistical reasoning. Critically, none meet practical thresholds for executability or result accuracy. This work presents the first comprehensive, human-led evaluation of LLM-generated statistical code and publicly releases the benchmark, establishing both methodological rigor and empirical grounding for trustworthy statistical AI.

Technology Category

Application Category

📝 Abstract
The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' statistical code quality
Assess LLMs in SAS programming tasks
Identify LLMs' limitations in domain understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates SAS code quality
Uses human expert assessment
Tests diverse statistical tasks
🔎 Similar Papers
No similar papers found.
X
Xinyi Song
Department of Statistics, Virginia Tech, Blacksburg, VA 24061
K
Kexin Xie
Department of Statistics, Virginia Tech, Blacksburg, VA 24061
Lina Lee
Lina Lee
Department of Statistics, Virginia Tech, Blacksburg, VA 24061
Ruizhe Chen
Ruizhe Chen
Zhejiang University
LLMMLLM
J
Jared M. Clark
Department of Statistics, Virginia Tech, Blacksburg, VA 24061
H
Hao He
Department of Statistics, Virginia Tech, Blacksburg, VA 24061
Haoran He
Haoran He
Hong Kong University of Science and Technology
machine learningreinforcement learning
J
Jie Min
Department of Mathematics & Statistics, University of South Florida, Tampa, FL 33620
Xinlei Zhang
Xinlei Zhang
Department of Statistics, Virginia Tech, Blacksburg, VA 24061
S
Simin Zheng
Department of Statistics, Virginia Tech, Blacksburg, VA 24061
Zhiyang Zhang
Zhiyang Zhang
Nanjing University
NLPLLMAgentAIOps
Xinwei Deng
Xinwei Deng
Professor of Statistics, Virginia Tech
machine learningdesign of experimentsuncertainty quantification
Yili Hong
Yili Hong
Professor of Statistics, Virginia Tech
Engineering StatisticsReliabilityMachine LearningStatistical ComputingBiostatistics