Examining LLMs Ability to Summarize Code Through Mutation-Analysis

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work proposes a mutation-based evaluation framework to assess the behavioral fidelity of code summaries generated by large language models (LLMs). By systematically injecting statement-, value-, and decision-level mutations into source code, the method evaluates whether LLM-generated summaries accurately reflect subtle changes in program behavior. This approach represents the first systematic application of mutation testing to measure behavioral consistency in code summarization, revealing a fundamental limitation: current models tend to describe high-level intent or generic patterns rather than precise implementation details. Empirical evaluation across 62 programs (624 assessments) shows that single-function summary accuracy reaches 76.5%, but drops sharply to 17.3% in multithreaded contexts. Although GPT-5.2 demonstrates significant improvement over GPT-4 (49.3% → 85.3%), it still struggles to capture fine-grained behavioral nuances.

Technology Category

Application Category

📝 Abstract

As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code's logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation-summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning, middle, end). We find that summary accuracy decreases sharply with complexity from 76.5% for single functions to 17.3% for multi-threaded systems, while mutation type and location exhibit weaker effects. Second, testing 150 mutated samples on 50 human-written programs from the Less Basic Python Problems (LBPP) dataset confirms the same failure patterns persist as models often describe algorithmic intent rather than actual mutated behavior with a summary accuracy rate of 49.3%. Furthermore, while a comparison between GPT-4 and GPT-5.2 shows a substantial performance leap (from 49.3% to 85.3%) and an improved ability to identify mutations as"bugs", both models continue to struggle with distinguishing implementation details from standard algorithmic patterns. This work establishes mutation analysis as a systematic approach for assessing whether LLM-generated summaries reflect program behavior rather than superficial textual patterns.

Problem

Research questions and friction points this paper is trying to address.

code summarization

large language models

program behavior

mutation analysis

code documentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

mutation analysis

code summarization

large language models