Explanation Quality Assessment as Ranking with Listwise Rewards

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work reframes the evaluation of explanation quality as a learning-to-rank problem, moving beyond conventional approaches that rely on generating a single optimal explanation or pointwise regression, which struggle to distinguish among explanations of varying quality levels. The study introduces listwise ranking methods—specifically ListNet, LambdaRank, and RankNet—to train a reward model capable of performing relative quality assessment over multiple candidate explanations while preserving their ordinal structure. Experimental results demonstrate that ranking-based losses consistently outperform regression-based counterparts across all domains. Furthermore, policy optimization using ranking-derived rewards achieves stable convergence, whereas regression-based rewards fail entirely. The findings also highlight that data quality exerts a more decisive influence than model scale, enabling smaller models to match the performance of significantly larger ones.

Technology Category

Application Category

📝 Abstract
We reformulate explanation quality assessment as a ranking problem rather than a generation problem. Instead of optimizing models to produce a single "best" explanation token-by-token, we train reward models to discriminate among multiple candidate explanations and learn their relative quality. Concretely, we construct per-instance candidate sets with graded quality levels and train listwise and pairwise ranking models (ListNet, LambdaRank, RankNet) to preserve ordinal structure and avoid score compression typical of pointwise regression or binary preference objectives. We observe three findings: First, ranking losses consistently outperform regression on score separation across all domains tested. Second, the optimal ranking loss depends on data characteristics: listwise objectives excel with well-separated quality tiers, while pairwise methods are more robust to noisy natural annotations. Third, when trained on carefully curated and well-structured data, small encoder models can match models that are orders of magnitude larger, suggesting that data quality matters more than model scale. Finally, when used as rewards in policy optimization, ranking-based scores enable stable convergence in settings where regression-based rewards fail entirely. Code and data are available at: https://github.com/Tankiit/PPO_Learning_to_rank
Problem

Research questions and friction points this paper is trying to address.

explanation quality assessment
ranking
reward modeling
listwise learning
ordinal evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

explanation quality assessment
learning to rank
listwise ranking
reward modeling
ordinal structure preservation
T
Thomas Bailleux
CRIL, Univ. Artois & CNRS, France
Tanmoy Mukherjee
Tanmoy Mukherjee
Meghnad Saha Institute of Technology
algorithms
E
Emmanuel Lonca
CRIL, Univ. Artois & CNRS, France
P
Pierre Marquis
CRIL, Univ. Artois & CNRS, France; Institut Universitaire de France
Zied Bouraoui
Zied Bouraoui
Professor of Computer Science, CRIL CNRS & Artois University
Artificial intelligence