Are We SOLID Yet? An Empirical Study on Prompting LLMs to Detect Design Principle Violations

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Traditional static analysis struggles to detect object-oriented semantic design flaws—particularly violations of SOLID principles—and existing approaches are largely confined to single principles or isolated programming languages. Method: We propose the first cross-language framework for detecting violations of all five SOLID principles, leveraging large language models (LLMs) and systematic prompt engineering. We construct a multilingual benchmark comprising 240 manually validated code examples and conduct a comprehensive evaluation of CodeLlama, DeepSeek-Coder, Qwen-Coder, and GPT-4o Mini under zero-shot, few-shot, chain-of-thought, and ensemble prompting strategies. Contribution/Results: GPT-4o Mini achieves the best overall performance, yet detecting high-abstraction violations (e.g., Dependency Inversion Principle) remains challenging. Prompt effectiveness is strongly influenced by language-specific features and code complexity. This work bridges a critical research gap in LLM-driven, multilingual SOLID violation detection and establishes a new paradigm for semantic-level design quality assessment.

Technology Category

Application Category

📝 Abstract

Traditional static analysis methods struggle to detect semantic design flaws, such as violations of the SOLID principles, which require a strong understanding of object-oriented design patterns and principles. Existing solutions typically focus on individual SOLID principles or specific programming languages, leaving a gap in the ability to detect violations across all five principles in multi-language codebases. This paper presents a new approach: a methodology that leverages tailored prompt engineering to assess LLMs on their ability to detect SOLID violations across multiple languages. We present a benchmark of four leading LLMs-CodeLlama, DeepSeekCoder, QwenCoder, and GPT-4o Mini-on their ability to detect violations of all five SOLID principles. For this evaluation, we construct a new benchmark dataset of 240 manually validated code examples. Using this dataset, we test four distinct prompt strategies inspired by established zero-shot, few-shot, and chain-of-thought techniques to systematically measure their impact on detection accuracy. Our emerging results reveal a stark hierarchy among models, with GPT-4o Mini decisively outperforming others, yet even struggles with challenging principles like DIP. Crucially, we show that prompt strategy has a dramatic impact, but no single strategy is universally best; for instance, a deliberative ENSEMBLE prompt excels at OCP detection while a hint-based EXAMPLE prompt is superior for DIP violations. Across all experiments, detection accuracy is heavily influenced by language characteristics and degrades sharply with increasing code complexity. These initial findings demonstrate that effective, AI-driven design analysis requires not a single best model, but a tailored approach that matches the right model and prompt to the specific design context, highlighting the potential of LLMs to support maintainability through AI-assisted code analysis.

Problem

Research questions and friction points this paper is trying to address.

Detecting SOLID design principle violations in code

Overcoming limitations of traditional static analysis methods

Assessing LLMs' multi-language violation detection capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging tailored prompt engineering for LLM-based SOLID violation detection

Testing four distinct prompt strategies including zero-shot and chain-of-thought

Creating benchmark dataset of 240 manually validated code examples

🔎 Similar Papers

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors