Argument-Based Comparative Question Answering Evaluation Benchmark

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of automatic evaluation of summary quality in comparative question answering (CQA). Methodologically, it introduces the first argumentation-based, multi-dimensional evaluation framework and benchmark: (1) a dual-track assessment system covering 15 fine-grained criteria, integrating human annotations with outputs from six large language models (including Llama-3-70B-Instruct and GPT-4); and (2) a reproducible consistency analysis pipeline leveraging diverse CQA datasets. Key contributions include: (i) establishing the first argumentation-grounded evaluation paradigm tailored to CQA; (ii) empirical findings showing Llama-3-70B-Instruct achieves superior performance on evaluation tasks, while GPT-4 excels in answer generation; and (iii) full open-sourcing of data, code, and results—providing a standardized evaluation infrastructure for the field.

Technology Category

Application Category

📝 Abstract
In this paper, we aim to solve the problems standing in the way of automatic comparative question answering. To this end, we propose an evaluation framework to assess the quality of comparative question answering summaries. We formulate 15 criteria for assessing comparative answers created using manual annotation and annotation from 6 large language models and two comparative question asnwering datasets. We perform our tests using several LLMs and manual annotation under different settings and demonstrate the constituency of both evaluations. Our results demonstrate that the Llama-3 70B Instruct model demonstrates the best results for summary evaluation, while GPT-4 is the best for answering comparative questions. All used data, code, and evaluation results are publicly availablefootnote{url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}.
Problem

Research questions and friction points this paper is trying to address.

Develops evaluation framework for comparative QA summaries.
Assesses 15 criteria using manual and LLM annotations.
Compares performance of multiple large language models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluation framework for summaries
Criteria from manual and LLM annotation
Publicly available data and code
🔎 Similar Papers
No similar papers found.
Irina Nikishina
Irina Nikishina
Postdoc @ University of Hamburg
Natural Language ProcessingRAGTaxonomiesQuestion Answering
S
Saba Anwar
University of Hamburg
N
Nikolay Dolgov
HSE University
M
Maria Manina
HSE University
D
Daria Ignatenko
HSE University
V
Viktor Moskvoretskii
HSE University, Skoltech
Artem Shelmanov
Artem Shelmanov
MBZUAI
uncertainty estimationfairnessactive learningnlpdeep learning
C
Christian Biemann
University of Hamburg