Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated code pull request (PR) review faces significant challenges, including noisy supervision signals, limited contextual understanding, and inadequate evaluation metrics. This work proposes Sphinx, a unified framework that enhances the contextual awareness and technical accuracy of large language models in PR review through structured comment generation, a checklist-based evaluation benchmark grounded in actionable verification points, and a novel training paradigm termed Checklist Reward Policy Optimization (CRPO). Sphinx constructs high-quality training data by contrasting pseudo-modified and merged code versions, integrates rule-based reward mechanisms for policy optimization, and introduces a coverage-oriented evaluation protocol. Experimental results demonstrate that Sphinx achieves state-of-the-art performance in both review completeness and precision, improving checklist coverage by up to 40% over existing open- and closed-source baselines.

Technology Category

Application Category

📝 Abstract
Pull request (PR) review is essential for ensuring software quality, yet automating this task remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics. We present Sphinx, a unified framework for LLM-based PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points, moving beyond surface-level metrics like BLEU; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real-world review practices. Extensive experiments show that models trained with Sphinx achieve state-of-the-art performance on review completeness and precision, outperforming both proprietary and open-source baselines by up to 40\% in checklist coverage. Together, Sphinx enables the development of PR review models that are not only fluent but also context-aware, technically precise, and practically deployable in real-world development workflows. The data will be released after review.
Problem

Research questions and friction points this paper is trying to address.

pull request review
noisy supervision
contextual understanding
evaluation metrics
code review automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based code review
structured data generation
checklist-based evaluation
reward policy optimization
context-aware modeling
🔎 Similar Papers
No similar papers found.
Daoan Zhang
Daoan Zhang
PhD Student, University of Rochester
Computer VisionMultimodal LearningLLM
S
Shuo Zhang
Microsoft
Zijian Jin
Zijian Jin
New York University
NLP
J
Jiebo Luo
University of Rochester
S
Shengyu Fu
Microsoft
E
Elsie Nallipogu
Microsoft