On the Reliability of Code Comprehension Proxies

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This study addresses the uncertainty surrounding the relative reliability of proxy metrics used in code understandability research. For the first time in software engineering, the Delphi method is employed to establish an expert-consensus-based “ground truth” ranking of code understandability. This benchmark is then used to systematically evaluate 14 proxy metrics collected through a controlled student experiment. The findings reveal that semantic-oriented proxies—particularly those based on input–output questions and response time—are the most reliable indicators of understandability, whereas syntactic measures prove consistently unreliable regardless of measurement strategy. These results challenge conclusions drawn in prior literature that rely heavily on syntactic proxies. The work thus provides both a methodological foundation and empirical evidence to guide future research on code understandability.

📝 Abstract

Prior work on code comprehension uses different comprehension proxies-for example, Likert-scale ratings or answers to input-output questions about program snippets, usually collected from students, to approximate whether code is comprehensible to software engineers, but the relative reliability of these proxies is not known. This paper investigates the relative reliability of a collection of proxies common in the extant literature with a pair of human studies. First, we conducted an expert-consensus study with a panel of five professional software engineers to establish a ground-truth comprehensibility ranking of eight code snippets by adapting the Delphi expert-consensus protocol. The Delphi protocol is widely used for expert consensus under conditions of uncertainty in other domains, such as medicine and national-security forecasting, but to our knowledge, this is its first application in software engineering. Second, we conducted a study with 44 student participants who completed tasks, allowing us to measure 14 comprehension proxies derived from the literature on the same set of eight code snippets. Finally, we conducted a correlation analysis on the results, concluding that proxies 1) derived from input-output questions and 2) that measure response time rather than accuracy are especially reliable. We also found that proxies derived from questions about program syntax (rather than semantics) are especially unreliable, regardless of measurement strategy, which draws into question the reliability of parts of the existing comprehensibility literature.

Problem

Research questions and friction points this paper is trying to address.

code comprehension

comprehension proxies

reliability

software engineering

human study

Innovation

Methods, ideas, or system contributions that make the work stand out.

Delphi protocol

code comprehension

comprehension proxies