🤖 AI Summary
This study addresses a critical gap in large language models (LLMs) for programming: their exclusive focus on code correctness while neglecting programmers’ subjective coding preferences. The authors formally define and empirically validate four key dimensions of such preferences—complexity, commenting, modularity, and readability—and introduce a dataset comprising 3,000 pairs of Python code snippets. Leveraging Likert-scale ratings from 73 expert developers, they evaluate the ability of 13 prominent LLMs to align with human preferences. Results reveal that model-generated code exhibits polarized preference distributions significantly divergent from human judgments. Even advanced models like GPT-5 demonstrate persistent limitations, including reliance on external assumptions, fragile reasoning, and misalignment between natural language instructions and implemented code styles, highlighting fundamental challenges in achieving stylistic alignment with human programmers.
📝 Abstract
Large Language Models (LLMs) have become increasingly popular for coding tasks, with subjective coding preferences being an essential element to adapt to programmers' personal needs. Existing work overlooks such characteristics and mainly focuses on code correctness. In this study, we propose a typification of four subjective coding preference axes - complexity, commenting, modularity, and readability - motivated by common engineering habits and validated by 25 software engineers. We collect a dataset of ~3,000 paired Python code snippets reflecting these axes, annotated by 73 experts who rate their preferences on a Likert scale. Using our dataset, we study how LLMs handle subjective coding preferences. We present 13 LLMs with pairs of solutions to the same programming task, first as textual descriptions and then as concrete code snippets. We find that models often prefer one option in natural language but the opposite when evaluating code. More consistent models (i.e., those that are coherent in their choices between deeds and words) frequently reveal positional bias: swapping the order of options changes the preferred alternative. We then use the five most consistent models to re-annotate the dataset. Compared to humans, models show polarized Likert distributions and notable divergence in ratings. A case study on GPT-5 reveals reliance on external assumptions and brittle reasoning.