🤖 AI Summary
Current LLM explanation evaluation relies heavily on binary preference judgments, lacking transparency and fine-grained attribution. To address this, we propose an attribute-based fine-grained evaluation framework: first, we identify key explainability attributes of high-quality reasoning—e.g., logical coherence and factual accuracy; second, we integrate automated metrics, LLM-based judgment, and human annotations into a multi-source scoring system; third, we employ SHAP analysis to quantify each attribute’s contribution to human preferences and design attribute-specific ELO scoring for interpretable model comparison. Experiments on MT-Bench and Chatbot Arena demonstrate that our framework significantly improves evaluation transparency and reliability. Attribute scores exhibit strong explanatory power for human preferences (R² > 0.85), enabling precise, attributable model diagnosis and ranking.
📝 Abstract
Large language models (LLMs) often generate natural language rationales -- free-form explanations that help improve performance on complex reasoning tasks and enhance interpretability for human users. However, evaluating these rationales remains challenging. While recent work has relied on binary preference judgments from humans or LLM judges, such evaluations are often opaque and coarse-grained, offering limited insight into what makes one rationale better than another. In this work, we rethink preference evaluation for LLM-generated rationales by asking: (1) What attributes define good rationales? (2) Can human preferences be explained by these attributes? (3) Can attribute-based evaluation overcome the limitations of binary comparisons? We identify a set of key rationale attributes from prior literature and assess them using automatic metrics, LLM judgments, and human annotations. We then analyze two standard human preference datasets MT Bench and Chatbot Arena using SHAP to identify which attributes best explain human preference outcomes. Finally, we re-evaluate model-generated rationales using attribute-specific ELO scores, revealing more nuanced model comparisons and insights. Our findings suggest that fine-grained attribute evaluations can better characterize rationale quality and guide future research toward more interpretable and reliable evaluation practices.