LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge that existing evaluation metrics for long-form question answering (LFQA) fail to accurately capture human fine-grained judgments of multi-sentence explanatory responses. To this end, the authors construct a large-scale LFQA dataset comprising 1.3 million human pairwise preference ratings and introduce a comprehensive answer quality scoring framework spanning nine dimensions. They develop a linear evaluation model based on rule-derived features that matches the performance of state-of-the-art large language model (LLM) evaluators while offering superior transparency and interpretability. Further analysis reveals systematic flaws in current LLM-based evaluators—including violations of transitivity, positional bias, and sensitivity to verbosity—as well as adversarial vulnerabilities, thereby highlighting the proposed method’s enhanced reliability and robustness.

Technology Category

Application Category

📝 Abstract

Long-form question answering (LFQA) demands nuanced evaluation of multi-sentence explanatory responses, yet existing metrics often fail to reflect human judgment. We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA. We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. We further examine transitivity consistency, positional bias, and verbosity biases in LLM evaluators and demonstrate their vulnerability to adversarial perturbations. Overall, this work provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation.

Problem

Research questions and friction points this paper is trying to address.

Long-Form Question Answering

Human Preference

Evaluation Metrics

Answer Quality

LLM Evaluators

Innovation

Methods, ideas, or system contributions that make the work stand out.

LFQA

human preference dataset

rubric-based evaluation