🤖 AI Summary
This work addresses the challenge that existing evaluation metrics for long-form question answering (LFQA) fail to accurately capture human fine-grained judgments of multi-sentence explanatory responses. To this end, the authors construct a large-scale LFQA dataset comprising 1.3 million human pairwise preference ratings and introduce a comprehensive answer quality scoring framework spanning nine dimensions. They develop a linear evaluation model based on rule-derived features that matches the performance of state-of-the-art large language model (LLM) evaluators while offering superior transparency and interpretability. Further analysis reveals systematic flaws in current LLM-based evaluators—including violations of transitivity, positional bias, and sensitivity to verbosity—as well as adversarial vulnerabilities, thereby highlighting the proposed method’s enhanced reliability and robustness.
📝 Abstract
Long-form question answering (LFQA) demands nuanced evaluation of multi-sentence explanatory responses, yet existing metrics often fail to reflect human judgment. We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA. We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. We further examine transitivity consistency, positional bias, and verbosity biases in LLM evaluators and demonstrate their vulnerability to adversarial perturbations. Overall, this work provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation.