Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Automated code pull request (PR) review faces significant challenges, including noisy supervision signals, limited contextual understanding, and inadequate evaluation metrics. This work proposes Sphinx, a unified framework that enhances the contextual awareness and technical accuracy of large language models in PR review through structured comment generation, a checklist-based evaluation benchmark grounded in actionable verification points, and a novel training paradigm termed Checklist Reward Policy Optimization (CRPO). Sphinx constructs high-quality training data by contrasting pseudo-modified and merged code versions, integrates rule-based reward mechanisms for policy optimization, and introduces a coverage-oriented evaluation protocol. Experimental results demonstrate that Sphinx achieves state-of-the-art performance in both review completeness and precision, improving checklist coverage by up to 40% over existing open- and closed-source baselines.

Technology Category

Application Category

📝 Abstract

Pull request (PR) review is essential for ensuring software quality, yet automating this task remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics. We present Sphinx, a unified framework for LLM-based PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points, moving beyond surface-level metrics like BLEU; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real-world review practices. Extensive experiments show that models trained with Sphinx achieve state-of-the-art performance on review completeness and precision, outperforming both proprietary and open-source baselines by up to 40\% in checklist coverage. Together, Sphinx enables the development of PR review models that are not only fluent but also context-aware, technically precise, and practically deployable in real-world development workflows. The data will be released after review.

Problem

Research questions and friction points this paper is trying to address.

pull request review

noisy supervision

contextual understanding

evaluation metrics

code review automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based code review

structured data generation

checklist-based evaluation