LogicScore: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the "attribution myopia" in current attribution-based evaluation methods for question answering, which overemphasize local factual matching while neglecting global logical coherence in long-form answers, often leading large language models to produce factually correct but logically inconsistent responses. To remedy this, the authors propose LogicScore, a novel framework that introduces Horn rules into attribution evaluation and employs a backward verification mechanism to conduct fine-grained analysis of multi-hop reasoning chains along three dimensions: completeness, conciseness, and determinacy. This approach transcends the conventional locality-centered paradigm by establishing a new evaluation standard that jointly accounts for factual accuracy and logical consistency. Extensive experiments across HotpotQA, MuSiQue, and 2WikiMultihopQA benchmarks with over 20 large language models reveal that, despite high attribution accuracy (e.g., 92.85% for Gemini-1.5 Pro), mainstream models exhibit substantially deficient logical quality, with conciseness scores as low as 35.11%.

Technology Category

Application Category

📝 Abstract
Current evaluation methods for Attributed Question Answering (AQA) suffer from \textit{attribution myopia}: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long-form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textsc{LogicScore}, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Conciseness} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85\% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11\% Conciseness for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: https://github.com/zhichaoyan11/LogicScore.
Problem

Research questions and friction points this paper is trying to address.

Attributed Question Answering
attribution myopia
logical coherence
reasoning evaluation
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

LogicScore
Attributed Question Answering
Logical Evaluation
Horn Rules
Global Reasoning
Z
Zhichao Yan
Shanxi University, Taiyuan, China
Y
Yunxiao Zhao
Shanxi University, Taiyuan, China
J
Jiapu Wang
Nanjing University of Science and Technology, Nanjing, China
Jiaoyan Chen
Jiaoyan Chen
Department of Computer Science, University of Manchester
Knowledge GraphOntologyMachine LearningLarge Language Model
S
Shaoru Guo
Shanxi University, Taiyuan, China
X
Xiaoli Li
Singapore University of Technology and Design, Singapore
Ru Li
Ru Li
Harbin Institute of Technology
Jeff Z. Pan
Jeff Z. Pan
Professor of Knowledge Computing, University of Edinburgh
Artificial IntelligenceKnowledge Representation and ReasoningKnowledge Based Learning