π€ AI Summary
This work addresses a critical gap in current code reasoning research, which predominantly focuses on isolated code snippets while overlooking the challenges posed by external API calls and unknown functions in real-world scenarios. To bridge this gap, we introduce CodeGlance, a multidimensional benchmark that systematically evaluates seven state-of-the-art large language models across three realistic dimensions: intrinsic logic, API interaction, and reasoning with unknown functions. Our evaluation reveals, for the first time, a pronounced deficiency of smaller models in handling unknown functionsβe.g., Qwen2.5-3B achieves only 6.0% accuracy. By quantifying the impact of complexity factors such as execution trace length and API call frequency on reasoning difficulty, and through ablation studies of enhancement strategies including chain-of-thought, documentation retrieval, and code search, we demonstrate that the efficacy of these strategies is highly dependent on the specific challenge type, offering actionable insights for practical deployment.
π Abstract
In modern software development, developers frequently need to understand code behavior at a glance -- whether reviewing pull requests, debugging issues, or navigating unfamiliar codebases. This ability to reason about dynamic program behavior is fundamental to effective software engineering and increasingly supported by Large Language Models (LLMs). However, existing studies on code reasoning focus primarily on isolated code snippets, overlooking the complexity of real-world scenarios involving external API interactions and unfamiliar functions. This gap hinders our understanding of what truly makes code reasoning challenging for LLMs across diverse programming contexts. We present CodeGlance, a multi-dimensional benchmark investigating code reasoning challenges across three realistic scenarios: intrinsic logic reasoning, API interaction reasoning, and unseen function reasoning. Through systematic evaluation of 7 state-of-the-art LLMs, we reveal that unseen function reasoning poses significant challenges especially for smaller models, with Qwen2.5-3b achieving only 6.0\% accuracy on unseen functions compared to 37.5\% on familiar APIs. We identify critical code complexity features -- including execution trace length, API invocation count, and control flow complexity -- that significantly impact code reasoning difficulty across scenarios. We further investigate how common augmentation strategies, including CoT, document retrieval, and code search, can improve reasoning performance, finding that their effectiveness varies substantially depending on whether challenges stem from logical complexity or knowledge gaps. These findings provide actionable guidance for developing more capable code reasoning systems and deploying LLM-based programming assistants in real-world software development.