LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing evaluation metrics for long-form question answering (LFQA) fail to accurately capture human fine-grained judgments of multi-sentence explanatory responses. To this end, the authors construct a large-scale LFQA dataset comprising 1.3 million human pairwise preference ratings and introduce a comprehensive answer quality scoring framework spanning nine dimensions. They develop a linear evaluation model based on rule-derived features that matches the performance of state-of-the-art large language model (LLM) evaluators while offering superior transparency and interpretability. Further analysis reveals systematic flaws in current LLM-based evaluators—including violations of transitivity, positional bias, and sensitivity to verbosity—as well as adversarial vulnerabilities, thereby highlighting the proposed method’s enhanced reliability and robustness.

Technology Category

Application Category

📝 Abstract
Long-form question answering (LFQA) demands nuanced evaluation of multi-sentence explanatory responses, yet existing metrics often fail to reflect human judgment. We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA. We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. We further examine transitivity consistency, positional bias, and verbosity biases in LLM evaluators and demonstrate their vulnerability to adversarial perturbations. Overall, this work provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation.
Problem

Research questions and friction points this paper is trying to address.

Long-Form Question Answering
Human Preference
Evaluation Metrics
Answer Quality
LLM Evaluators
Innovation

Methods, ideas, or system contributions that make the work stand out.

LFQA
human preference dataset
rubric-based evaluation
LLM evaluator bias
adversarial robustness
🔎 Similar Papers
No similar papers found.
R
Rafid Ishrak Jahan
University of North Texas, Denton, Texas, USA
F
Fahmid Shahriar Iqbal
University of North Texas, Denton, Texas, USA
Sagnik Ray Choudhury
Sagnik Ray Choudhury
University of North Texas
digital libraryNLPexplainabilityinformation extraction