Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work investigates the robustness of large language models (LLMs) in reasoning about core software design principles—module cohesion and coupling—particularly under noise interference and open-ended scenarios. We propose a hierarchical evaluation framework integrating programmatically generated poorly designed code, multi-stage tasks (verification, guidance, open-ended generation), and reasoning trace analysis to quantify F1-score degradation. Experiments reveal that while LLMs exhibit baseline design understanding under ideal conditions, performance deteriorates markedly in open and noisy settings: coupling reasoning suffers over 50% F1 decline, whereas cohesion analysis remains relatively stable. Further analysis uncovers intrinsic limitations—including reliance on cognitive shortcuts and neglect of structural constraints. To our knowledge, this is the first systematic characterization of LLMs’ reasoning fragility at the software design principle level, establishing both theoretical foundations and empirical benchmarks for trustworthy AI-assisted design.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are being increasingly adopted in the software engineering domain, yet the robustness of their grasp on core software design concepts remains unclear. We conduct an empirical study to systematically evaluate their understanding of cohesion (intra-module) and coupling (inter-module). We programmatically generate poorly designed code fragments and test the DeepSeek-R1 model family ($14$B, $32$B, $70$B) under varying levels of guidance, from simple extit{Verification} to extit{Guided} and extit{Open-ended Generation}, while varying contextual noise by injecting distractor elements. While models exhibit a solid baseline understanding of both concepts in ideal conditions, their practical knowledge is fragile and highly asymmetrical. Reasoning about coupling proves brittle; performance collapses in noisy, open-ended scenarios, with F1 scores dropping by over $50%$. In contrast, the models' analysis of cohesion is remarkably robust to internal noise in guided tasks, showing little performance degradation. However, this resilience also fails when all guidance is removed. Reasoning-trace analysis confirms these failure modes, revealing extit{cognitive shortcutting} for coupling versus a more exhaustive (yet still failing) analysis for cohesion. To summarize, while LLMs can provide reliable assistance for recognizing design flaws, their ability to reason autonomously in noisy, realistic contexts is limited, highlighting the critical need for more scalable and robust program understanding capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' understanding of software design concepts cohesion and coupling

Testing model robustness with programmatically generated poorly designed code fragments

Assessing performance degradation in noisy open-ended generation scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Programmatically generated poorly designed code fragments

Tested models under varying guidance levels

Evaluated robustness with injected distractor elements

🔎 Similar Papers

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates