🤖 AI Summary
To address the challenge of code quality assurance in large-scale collaborative development, this paper proposes a two-stage code review method based on Llama3. In Stage I, the model is jointly optimized via supervised fine-tuning (SFT) and offline reinforcement learning to accurately identify potential defects in code changes. In Stage II, an interpretable, rule-based filtering mechanism is introduced to mitigate hallucinations inherent in large language models, thereby ensuring the accuracy and actionability of review feedback. The method enables end-to-end automated code review on industrial-scale repositories, achieving an offline F1 score of 92.3%. After deployment, weekly retained user satisfaction consistently exceeds 60%, significantly improving developer experience and code maintainability. The core contribution lies in a lightweight, controllable review paradigm that synergistically integrates data-driven modeling with human-crafted, interpretable constraints.
📝 Abstract
Maintaining code quality in large-scale software systems presents significant challenges, particularly in settings where a large numbers of engineers work concurrently on a codebase. This paper introduces Code Quality Score (CQS) system to automatically detect issues with a set of code changes and provide actionable insights. At its core, the CQS system is powered by two Llama3 models, fine-tuned (with SFT and offline RL approaches), to a) detect common code quality issues related to coding best practices and b) to provide good ``critiques'' for LLM-generated code review respectively. To maintain good user experience, we layer the system with hand-crafted rules to filter out incorrect responses/hallucinations. Offline evaluations show that our CQS system is able to achieve an impressive precision rate for identifying valid issues. This system has already been rolled out to developers in an industrial scale setting and has consistently achieved 60% week over week user helpfulness rate, demonstrating its effectiveness in a real-world environment. In this paper, we present details of the CQS system along with some learnings on curating developer feedback to create training data for LLM fine-tuning.