🤖 AI Summary
Existing paper quality assessment methods suffer from high inference costs (LLMs) or inconsistent rating scales (regression models). This work proposes a lightweight, unbiased ranking framework based on domain- and year-aware debiased pairwise learning. We introduce Review Tendency Signals (RTS), a supervision signal that jointly encodes reviewer scores and confidence. Our method integrates probabilistic signal aggregation with structured metadata modeling, trained on our large-scale, self-constructed dataset NAIDv2. The resulting model achieves linear-time inference complexity while maintaining strong generalization. On the ICLR test set, it attains 78.2% AUC and 0.432 Spearman correlation—state-of-the-art performance. It sustains robust accuracy on unseen NeurIPS submissions and guarantees strict monotonicity of predicted scores with respect to acceptance decisions (e.g., reject < weak accept < accept < strong accept).
📝 Abstract
The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems. Code and dataset are released at https://sway.cloud.microsoft/Pr42npP80MfPhvj8.