🤖 AI Summary
This work identifies a critical flaw in large language models’ (LLMs) code understanding: their heavy reliance on identifier naming (the *intent channel*) rather than structural semantics (the *behavior channel*). Experiments show that removing identifier names causes severe degradation in intent-oriented tasks (e.g., code summarization), reducing outputs to line-by-line descriptions; surprisingly, performance also drops significantly in execution-oriented tasks—suggesting existing benchmarks are contaminated by spurious naming-pattern memorization. To address this, the authors propose semantics-preserving code obfuscation and introduce **ClassEval-Obf**, a rigorously de-biased evaluation benchmark that systematically eliminates identifier leakage and memorization shortcuts. ClassEval-Obf substantially reduces inflated scores of state-of-the-art models on both summarization and execution tasks, enabling, for the first time, reliable assessment of genuine code comprehension. This benchmark establishes a reproducible, bias-resistant standard for evaluating code understanding in LLMs.
📝 Abstract
Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs' code understanding and generalization.