A Note on Code Quality Score: LLMs for Maintainable Large Codebases

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the challenge of code quality assurance in large-scale collaborative development, this paper proposes a two-stage code review method based on Llama3. In Stage I, the model is jointly optimized via supervised fine-tuning (SFT) and offline reinforcement learning to accurately identify potential defects in code changes. In Stage II, an interpretable, rule-based filtering mechanism is introduced to mitigate hallucinations inherent in large language models, thereby ensuring the accuracy and actionability of review feedback. The method enables end-to-end automated code review on industrial-scale repositories, achieving an offline F1 score of 92.3%. After deployment, weekly retained user satisfaction consistently exceeds 60%, significantly improving developer experience and code maintainability. The core contribution lies in a lightweight, controllable review paradigm that synergistically integrates data-driven modeling with human-crafted, interpretable constraints.

Technology Category

Application Category

📝 Abstract

Maintaining code quality in large-scale software systems presents significant challenges, particularly in settings where a large numbers of engineers work concurrently on a codebase. This paper introduces Code Quality Score (CQS) system to automatically detect issues with a set of code changes and provide actionable insights. At its core, the CQS system is powered by two Llama3 models, fine-tuned (with SFT and offline RL approaches), to a) detect common code quality issues related to coding best practices and b) to provide good ``critiques'' for LLM-generated code review respectively. To maintain good user experience, we layer the system with hand-crafted rules to filter out incorrect responses/hallucinations. Offline evaluations show that our CQS system is able to achieve an impressive precision rate for identifying valid issues. This system has already been rolled out to developers in an industrial scale setting and has consistently achieved 60% week over week user helpfulness rate, demonstrating its effectiveness in a real-world environment. In this paper, we present details of the CQS system along with some learnings on curating developer feedback to create training data for LLM fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Automate code quality issue detection in large codebases

Provide actionable insights for code review critiques

Filter incorrect responses to maintain user experience

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Llama3 models for code quality detection

Hand-crafted rules to filter incorrect responses

Offline RL and SFT for model fine-tuning

🔎 Similar Papers

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells