🤖 AI Summary
Automated code pull request (PR) review faces significant challenges, including noisy supervision signals, limited contextual understanding, and inadequate evaluation metrics. This work proposes Sphinx, a unified framework that enhances the contextual awareness and technical accuracy of large language models in PR review through structured comment generation, a checklist-based evaluation benchmark grounded in actionable verification points, and a novel training paradigm termed Checklist Reward Policy Optimization (CRPO). Sphinx constructs high-quality training data by contrasting pseudo-modified and merged code versions, integrates rule-based reward mechanisms for policy optimization, and introduces a coverage-oriented evaluation protocol. Experimental results demonstrate that Sphinx achieves state-of-the-art performance in both review completeness and precision, improving checklist coverage by up to 40% over existing open- and closed-source baselines.
📝 Abstract
Pull request (PR) review is essential for ensuring software quality, yet automating this task remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics. We present Sphinx, a unified framework for LLM-based PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points, moving beyond surface-level metrics like BLEU; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real-world review practices. Extensive experiments show that models trained with Sphinx achieve state-of-the-art performance on review completeness and precision, outperforming both proprietary and open-source baselines by up to 40\% in checklist coverage. Together, Sphinx enables the development of PR review models that are not only fluent but also context-aware, technically precise, and practically deployable in real-world development workflows. The data will be released after review.