Inferential Question Answering

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the challenge that existing question answering systems struggle with questions whose answers are not explicitly stated but must be inferred from contextual clues. It formally defines and systematically investigates the task of "reasoning-based question answering," introducing QUIT—a high-quality dataset comprising 7,401 questions and 2.4 million paragraphs, annotated with multi-level relevance labels through a human-in-the-loop curation process. The proposed approach integrates large language model–based answer plausibility scoring, human verification, a retrieval–reranking pipeline, and a reasoning-oriented reader fine-tuning strategy. Experimental results reveal that conventional QA methods perform poorly on this task: retrieval is ineffective, reranking yields limited gains, and fine-tuning remains unstable—even specialized reasoning models fail to outperform general-purpose smaller models—highlighting a significant gap in current systems’ capacity for complex inference.

Technology Category

Application Category

📝 Abstract

Despite extensive research on a wide range of question answering (QA) systems, most existing work focuses on answer containment-i.e., assuming that answers can be directly extracted and/or generated from documents in the corpus. However, some questions require inference, i.e., deriving answers that are not explicitly stated but can be inferred from the available information. We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues. To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Through comprehensive evaluation of retrievers, rerankers, and LLM-based readers, we show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models. These findings reveal that current QA pipelines are not yet ready for inference-based reasoning. Inferential QA thus establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence.

Problem

Research questions and friction points this paper is trying to address.

Inferential QA

question answering

textual inference

reasoning

answerability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inferential QA

reasoning

QUIT dataset