MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs

πŸ“… 2025-06-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing automatic evaluation methods for open-domain question answering (QA) suffer from four key limitations: weak semantic sensitivity of traditional metrics (e.g., ROUGE, BERTScore); poor interpretability of LLM-based evaluators; pointwise scoring lacking problem-specific adaptation; and failure to distinguish between factual and non-factual questions. To address these, we propose the first question-type-aware dual-path evaluation paradigm: for factual questions, we design a keypoint-driven adaptive scoring mechanism; for non-factual questions, we introduce a context-aware, instance-conditioned listwise ranking approach. Our method integrates fine-grained semantic parsing, dynamic keypoint extraction, and multi-dimensional human-alignment optimization. Evaluated across multiple open-domain QA benchmarks, our approach achieves an average 19.3% improvement in Kendall’s Ο„ over ROUGE, BERTScore, and state-of-the-art LLM-based evaluators, while boosting interpretability scores by 41%.

Technology Category

Application Category

πŸ“ Abstract
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose extbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing factoid and non-factoid open-ended QA evaluation
Improving semantic similarity capture in automatic evaluation
Enhancing interpretability of LLM-based evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distinguishes factoid and non-factoid questions
Uses adaptive key-point scoring for factoids
Applies instance-aware ranking for non-factoids
Yongqi Fan
Yongqi Fan
East China University of Science and Technology
LLMAI SearchMedical NLPIRAgentic RL
Y
Yating Wang
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
G
Guandong Wang
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
J
Jie Zhai
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
Jingping Liu
Jingping Liu
ECUST
large language modelknowledge graph
Q
Qi Ye
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
Tong Ruan
Tong Ruan
East China University of Science and Technology
Clinical NLPLLMKG