When Names Disappear: Revealing What LLMs Actually Understand About Code

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work identifies a critical flaw in large language models’ (LLMs) code understanding: their heavy reliance on identifier naming (the *intent channel*) rather than structural semantics (the *behavior channel*). Experiments show that removing identifier names causes severe degradation in intent-oriented tasks (e.g., code summarization), reducing outputs to line-by-line descriptions; surprisingly, performance also drops significantly in execution-oriented tasks—suggesting existing benchmarks are contaminated by spurious naming-pattern memorization. To address this, the authors propose semantics-preserving code obfuscation and introduce **ClassEval-Obf**, a rigorously de-biased evaluation benchmark that systematically eliminates identifier leakage and memorization shortcuts. ClassEval-Obf substantially reduces inflated scores of state-of-the-art models on both summarization and execution tasks, enabling, for the first time, reliable assessment of genuine code comprehension. This benchmark establishes a reproducible, bias-resistant standard for evaluating code understanding in LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs' code understanding and generalization.

Problem

Research questions and friction points this paper is trying to address.

LLMs rely on naming patterns rather than structural semantics for code understanding

Current benchmarks fail to assess genuine semantic reasoning in code tasks

Obfuscation techniques reveal identifier leakage in both execution and summarization tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces semantics-preserving obfuscations to expose identifier leakage

Releases ClassEval-Obf benchmark suppressing naming cues

Systematically assesses LLMs' code understanding and generalization

🔎 Similar Papers

No similar papers found.