Measuring LLM Code Generation Stability via Structural Entropy

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the lack of stability assessment for large language models (LLMs) in code generation. We propose a lightweight, reference-free, and execution-free structural stability evaluation method. Our approach integrates structural entropy with abstract syntax tree (AST) analysis: it extracts depth-limited subtrees and models their distributional properties to define two complementary metrics—Jensen–Shannon divergence and structural cross-entropy ratio—enabling language-agnostic, fine-grained stability quantification. Crucially, the method discriminates between mutation patterns at control-flow and identifier levels. With time complexity O(n, d), it is computationally efficient and validated across mainstream LLMs. Experiments reveal significant inter-model differences in consistency and robustness, demonstrating its effectiveness in exposing stability bottlenecks. The framework is modular and readily integrable into existing evaluation pipelines.

Technology Category

Application Category

📝 Abstract
Assessing the stability of code generation from large language models (LLMs) is essential for judging their reliability in real-world development. We extend prior "structural-entropy concepts" to the program domain by pairing entropy with abstract syntax tree (AST) analysis. For any fixed prompt, we collect the multiset of depth-bounded subtrees of AST in each generated program and treat their relative frequencies as a probability distribution. We then measure stability in two complementary ways: (i) Jensen-Shannon divergence, a symmetric, bounded indicator of structural overlap, and (ii) a Structural Cross-Entropy ratio that highlights missing high-probability patterns. Both metrics admit structural-only and token-aware variants, enabling separate views on control-flow shape and identifier-level variability. Unlike pass@k, BLEU, or CodeBLEU, our metrics are reference-free, language-agnostic, and execution-independent. We benchmark several leading LLMs on standard code generation tasks, demonstrating that AST-driven structural entropy reveals nuances in model consistency and robustness. The method runs in O(n,d) time with no external tests, providing a lightweight addition to the code-generation evaluation toolkit.
Problem

Research questions and friction points this paper is trying to address.

Assess LLM code generation stability via structural entropy
Develop reference-free metrics using AST analysis and entropy
Evaluate model consistency without external tests or references
Innovation

Methods, ideas, or system contributions that make the work stand out.

AST-based structural entropy for stability
Jensen-Shannon divergence for structural overlap
Structural Cross-Entropy ratio for missing patterns
🔎 Similar Papers
No similar papers found.
Yewei Song
Yewei Song
Ph.D. Candidate, University of Luxembourg
natural language processingsoftware engineering
Tiezhu Sun
Tiezhu Sun
PhD, University of Luxembourg
AI4SEAI4CybersecurityMalware LearningLLMs
X
Xunzhu Tang
The Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg
P
Prateek Rajput
The Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg
T
Tegawende F. Bissyande
The Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg
Jacques Klein
Jacques Klein
University of Luxembourg / SnT
Computer ScienceSoftware EngineeringAndroid SecuritySoftware SecurityModel-Driven Engineering