Human-Aligned Code Readability Assessment with Large Language Models

📅 2025-10-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional static metrics fail to capture the human subjectivity and context-dependency inherent in code readability assessment. Method: We introduce CoReEval, the first large-scale, developer-centric benchmark for code readability evaluation, covering 10 mainstream LLMs, multi-language code (Python, Java, C++), and diverse prompting strategies (zero-shot, few-shot, chain-of-thought, tree-of-thought) alongside varied decoding configurations. We propose a developer-oriented prompting framework with role-based persona modeling to enable interpretable, lightweight personalized evaluation, and integrate multi-dimensional validation via static metric correlation, sentiment analysis, semantic clustering, and relevance scoring. Results: Human-defined dimensional prompting significantly improves inter-annotator agreement and explanation quality; experiments further uncover an intrinsic trade-off between model alignment and stability. CoReEval establishes both theoretical foundations and practical paradigms for LLM-driven code review.

Technology Category

Application Category

📝 Abstract
Code readability is crucial for software comprehension and maintenance, yet difficult to assess at scale. Traditional static metrics often fail to capture the subjective, context-sensitive nature of human judgments. Large Language Models (LLMs) offer a scalable alternative, but their behavior as readability evaluators remains underexplored. We introduce CoReEval, the first large-scale benchmark for evaluating LLM-based code readability assessment, comprising over 1.4 million model-snippet-prompt evaluations across 10 state of the art LLMs. The benchmark spans 3 programming languages (Java, Python, CUDA), 2 code types (functional code and unit tests), 4 prompting strategies (ZSL, FSL, CoT, ToT), 9 decoding settings, and developer-guided prompts tailored to junior and senior personas. We compare LLM outputs against human annotations and a validated static model, analyzing numerical alignment (MAE, Pearson's, Spearman's) and justification quality (sentiment, aspect coverage, semantic clustering). Our findings show that developer-guided prompting grounded in human-defined readability dimensions improves alignment in structured contexts, enhances explanation quality, and enables lightweight personalization through persona framing. However, increased score variability highlights trade-offs between alignment, stability, and interpretability. CoReEval provides a robust foundation for prompt engineering, model alignment studies, and human in the loop evaluation, with applications in education, onboarding, and CI/CD pipelines where LLMs can serve as explainable, adaptable reviewers.
Problem

Research questions and friction points this paper is trying to address.

Assessing code readability at scale using Large Language Models
Evaluating alignment between LLM assessments and human judgments
Developing benchmark for LLM-based code readability evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developer-guided prompting improves human alignment
CoReEval benchmark enables scalable readability evaluation
LLMs serve as explainable adaptable code reviewers
🔎 Similar Papers
No similar papers found.