🤖 AI Summary
This study reveals a systematic cognitive bias in mainstream code large language models (Code LLMs) regarding software design patterns: average pattern identification accuracy falls below 58%, and only ~32% of generated code satisfies both the semantic and structural constraints of the target pattern, compromising downstream task reliability. To address this, the authors introduce the first comprehensive design pattern benchmark covering three core capabilities—identification, comprehension, and generation—and propose a multidimensional evaluation framework integrating prompt engineering, expert annotation, and statistical analysis. Through cross-model and cross-pattern controlled experiments, the work provides the first systematic diagnosis of Code LLMs’ deficiencies at the design paradigm level. It establishes a reproducible methodological foundation and empirical evidence for rigorous model assessment, prompt optimization, and domain-aligned fine-tuning.
📝 Abstract
Code Large Language Models (LLMs) demonstrate great versatility in adapting to various downstream tasks, including code generation and completion, as well as bug detection and fixing. However, Code LLMs often fail to capture existing coding standards, leading to the generation of code that conflicts with the required design patterns for a given project. As a result, developers must post-process to adapt the generated code to the project's design norms. In this work, we empirically investigate the biases of Code LLMs in software development. Through carefully designed experiments, we assess the models' understanding of design patterns across recognition, comprehension, and generation. Our findings reveal that biases in Code LLMs significantly affect the reliability of downstream tasks.