🤖 AI Summary
This work reframes the evaluation of explanation quality as a learning-to-rank problem, moving beyond conventional approaches that rely on generating a single optimal explanation or pointwise regression, which struggle to distinguish among explanations of varying quality levels. The study introduces listwise ranking methods—specifically ListNet, LambdaRank, and RankNet—to train a reward model capable of performing relative quality assessment over multiple candidate explanations while preserving their ordinal structure. Experimental results demonstrate that ranking-based losses consistently outperform regression-based counterparts across all domains. Furthermore, policy optimization using ranking-derived rewards achieves stable convergence, whereas regression-based rewards fail entirely. The findings also highlight that data quality exerts a more decisive influence than model scale, enabling smaller models to match the performance of significantly larger ones.
📝 Abstract
We reformulate explanation quality assessment as a ranking problem rather than a generation problem. Instead of optimizing models to produce a single "best" explanation token-by-token, we train reward models to discriminate among multiple candidate explanations and learn their relative quality. Concretely, we construct per-instance candidate sets with graded quality levels and train listwise and pairwise ranking models (ListNet, LambdaRank, RankNet) to preserve ordinal structure and avoid score compression typical of pointwise regression or binary preference objectives. We observe three findings: First, ranking losses consistently outperform regression on score separation across all domains tested. Second, the optimal ranking loss depends on data characteristics: listwise objectives excel with well-separated quality tiers, while pairwise methods are more robust to noisy natural annotations. Third, when trained on carefully curated and well-structured data, small encoder models can match models that are orders of magnitude larger, suggesting that data quality matters more than model scale. Finally, when used as rewards in policy optimization, ranking-based scores enable stable convergence in settings where regression-based rewards fail entirely. Code and data are available at: https://github.com/Tankiit/PPO_Learning_to_rank