When Names Disappear: Revealing What LLMs Actually Understand About Code

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical flaw in large language models’ (LLMs) code understanding: their heavy reliance on identifier naming (the *intent channel*) rather than structural semantics (the *behavior channel*). Experiments show that removing identifier names causes severe degradation in intent-oriented tasks (e.g., code summarization), reducing outputs to line-by-line descriptions; surprisingly, performance also drops significantly in execution-oriented tasks—suggesting existing benchmarks are contaminated by spurious naming-pattern memorization. To address this, the authors propose semantics-preserving code obfuscation and introduce **ClassEval-Obf**, a rigorously de-biased evaluation benchmark that systematically eliminates identifier leakage and memorization shortcuts. ClassEval-Obf substantially reduces inflated scores of state-of-the-art models on both summarization and execution tasks, enabling, for the first time, reliable assessment of genuine code comprehension. This benchmark establishes a reproducible, bias-resistant standard for evaluating code understanding in LLMs.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs' code understanding and generalization.
Problem

Research questions and friction points this paper is trying to address.

LLMs rely on naming patterns rather than structural semantics for code understanding
Current benchmarks fail to assess genuine semantic reasoning in code tasks
Obfuscation techniques reveal identifier leakage in both execution and summarization tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces semantics-preserving obfuscations to expose identifier leakage
Releases ClassEval-Obf benchmark suppressing naming cues
Systematically assesses LLMs' code understanding and generalization
🔎 Similar Papers
No similar papers found.
Cuong Chi Le
Cuong Chi Le
FPT Software AI Center, University of Texas at Dallas
AI4SEMachine LearningLLMAutomated Software Engineering
M
Minh V. T. Pham
FPT Software AI Center, Hanoi, Vietnam
C
Cuong Duc Van
FPT Software AI Center, Hanoi, Vietnam
H
Hoang N. Phan
Nanyang Technological University, Singapore
H
Huy N. Phan
FPT Software AI Center, Hanoi, Vietnam
Tien N. Nguyen
Tien N. Nguyen
Professor, School of Engineering and Computer Science - The University of Texas at Dallas
AI4SEAutomated Software EngineeringArtificial IntelligenceMining Software Repositories