Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how large language models (LLMs) rely on semantic memory during long-context code reasoning, revealing a fundamental dissociation between lexical recall (verbatim retrieval) and semantic recall (functional understanding). To address this, we introduce SemTrace—a novel, attributable and non-deterministic semantic tracing technique—and establish a quantitative framework for measuring semantic recall sensitivity. We find that existing benchmarks exhibit severe insensitivity to semantic variations. Empirical analysis shows: (1) reasoning accuracy drops significantly on code snippets located in the middle of long inputs; (2) function-level lexical recall is robust, whereas line-level recall is weak; and (3) lexical and semantic recall capabilities are statistically uncorrelated. Our methodology provides a principled approach to evaluating and improving LLMs’ long-range code comprehension, offering critical insights into the interplay between memory mechanisms and structural reasoning in program understanding.

Technology Category

Application Category

📝 Abstract
Although modern Large Language Models (LLMs) support extremely large contexts, their effectiveness in utilizing long context for code reasoning remains unclear. This paper investigates LLM reasoning ability over code snippets within large repositories and how it relates to their recall ability. Specifically, we differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does). To measure semantic recall, we propose SemTrace, a code reasoning technique where the impact of specific statements on output is attributable and unpredictable. We also present a method to quantify semantic recall sensitivity in existing benchmarks. Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context, particularly with techniques requiring high semantic recall like SemTrace. Moreover, we find that lexical recall varies by granularity, with models excelling at function retrieval but struggling with line-by-line recall. Notably, a disconnect exists between lexical and semantic recall, suggesting different underlying mechanisms. Finally, our findings indicate that current code reasoning benchmarks may exhibit low semantic recall sensitivity, potentially underestimating LLM challenges in leveraging in-context information.
Problem

Research questions and friction points this paper is trying to address.

Investigates LLM code reasoning ability in large repositories
Measures semantic recall impact on code reasoning accuracy
Evaluates disconnect between lexical and semantic recall mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes SemTrace for semantic code recall measurement
Quantifies semantic recall sensitivity in benchmarks
Evaluates LLM code reasoning accuracy with context position