Memorize or Generalize? Evaluating LLM Code Generation with Evolved Questions

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work identifies a pervasive memorization phenomenon in large language models (LLMs) for code generation: models frequently reproduce prompt–answer pairs from training data rather than internalizing programming principles, severely undermining generalization. To quantify this, we propose the first AST-based memorization score measuring syntactic similarity between generated and training code. We systematically assess memorization via code mutation, prompt rewriting, and semantic-preserving problem rephrasing to generate diverse input variants. Experiments reveal a non-monotonic memorization trend during supervised fine-tuning—initially increasing then decreasing—aligning with overfitting dynamics. Strong memorization is consistently observed across multiple state-of-the-art code LLMs. Moreover, common mitigation strategies—including prompt translation and data augmentation—prove ineffective and often degrade original task performance. Our framework provides both a theoretical foundation and an empirical benchmark for analyzing and mitigating code memorization in LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are known to exhibit a memorization phenomenon in code generation: instead of truly understanding the underlying principles of a programming problem, they tend to memorize the original prompt and its solution together in the training. Consequently, when facing variants of the original problem, their answers very likely resemble the memorized solutions and fail to generalize. In this paper, we investigate this phenomenon by designing three evolution strategies to create variants: mutation, paraphrasing, and code-rewriting. By comparing the performance and AST similarity of the LLM-generated codes before and after these three evolutions, we develop a memorization score that positively correlates with the level of memorization. As expected, as supervised fine-tuning goes on, the memorization score rises before overfitting, suggesting more severe memorization. We demonstrate that common mitigation approaches, such as prompt translation and using evolved variants as data augmentation in supervised learning and reinforcement learning, either compromise the performance or fail to alleviate the memorization issue. Therefore, memorization remains a significant challenge in LLM code generation, highlighting the need for a more effective solution.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM memorization in code generation

Evaluating generalization with evolved question variants

Developing a memorization score for LLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed three evolution strategies for code variants

Created memorization score to assess LLM memorization

Tested mitigation approaches for memorization in LLMs

🔎 Similar Papers

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code