Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the reasoning robustness of large language models (LLMs) on code understanding tasks under semantics-preserving mutations, aiming to distinguish correct predictions rooted in deep semantic comprehension from those arising from superficial pattern matching. Method: We systematically construct five categories of semantics-invariant code mutations—including variable renaming, branch swapping, and loop transformations—as diagnostic probes for reasoning robustness. Using LiveCodeBench and CruxEval benchmarks, we conduct cross-mutation stability evaluations across six LLMs with ≤8B parameters and perform human-annotated reasoning attribution analysis. Results: We find that up to 61% of correct predictions by state-of-the-art models (e.g., Llama3.2) stem from flawed reasoning. Moreover, most models exhibit substantial prediction instability under semantics-preserving perturbations, revealing a systemic fragility in deep code semantic understanding. These findings highlight critical limitations in current LLMs’ code reasoning capabilities and underscore the need for more semantically grounded evaluation methodologies.

Technology Category

Application Category

📝 Abstract
Understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies have assessed LLMs' ability to predict program outputs, most focus solely on the accuracy of those predictions, without evaluating the reasoning behind them. Moreover, it has been observed on mathematical reasoning tasks that LLMs can arrive at correct answers through flawed logic, raising concerns about similar issues in code understanding. In this work, we evaluate whether state-of-the-art LLMs with up to 8B parameters can reason about Python programs or are simply guessing. We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling. These mutations maintain program semantics while altering its syntax. We evaluated six LLMs and performed a human expert analysis using LiveCodeBench to assess whether the correct predictions are based on sound reasoning. We also evaluated prediction stability across different code mutations on LiveCodeBench and CruxEval. Our findings show that some LLMs, such as Llama3.2, produce correct predictions based on flawed reasoning in up to 61% of cases. Furthermore, LLMs often change predictions in response to our code mutations, indicating limited robustness in their semantic understanding.
Problem

Research questions and friction points this paper is trying to address.

Assess LLMs' reasoning robustness in code understanding tasks
Evaluate if LLMs guess or reason about Python programs
Test LLMs' prediction stability under semantics-preserving code mutations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs with semantics-preserving code mutations
Assesses reasoning via human expert and benchmark analysis
Reveals flawed logic in correct predictions by LLMs
🔎 Similar Papers
No similar papers found.