QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the critical limitation of large language models (LLMs) in identifying minimally necessary clarification questions for underspecified reasoning tasks. To this end, we introduce QuestBench—the first benchmark dedicated to single-question information acquisition—covering four task categories: logic, planning, mathematics (GSM-Q), and extended mathematics (GSME-Q). Methodologically, we formalize clarification question generation as a constraint satisfaction problem (CSP), propose a quantifiable single-question clarification paradigm, and design a multiple-choice evaluation protocol. Experimental results reveal that state-of-the-art LLMs achieve over 90% accuracy on mathematical tasks but only 40–50% on logic and planning tasks. Crucially, problem-solving competence is significantly decoupled from question-asking ability, and models exhibit a systematic bias against selecting “uncertain” options. These findings expose a fundamental deficiency in current LLMs’ information acquisition mechanisms—particularly their inability to reliably elicit missing premises through targeted, minimal queries.

Technology Category

Application Category

📝 Abstract

Recently, a large amount of work has focused on improving large language models' (LLMs') performance on reasoning benchmarks such as math and logic. However, past work has largely assumed that tasks are well-defined. In the real world, queries to LLMs are often underspecified, only solvable through acquiring missing information. We formalize this as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case of this formalism where only one necessary variable assignment is missing, we can rigorously evaluate an LLM's ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially-observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators. The LLM is tasked with selecting the correct clarification question(s) from a list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that the ability to solve well-specified reasoning problems may not be sufficient for success on our benchmark: models have difficulty identifying the right question to ask, even when they can solve the fully specified version of the problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even when explicitly presented with the option to predict ``not sure.'' This highlights the need for deeper investigation into models' information acquisition capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to identify minimal necessary questions in underspecified reasoning tasks

Assessing LLM performance on constraint satisfaction problems with missing variable assignments

Investigating LLMs' difficulty in asking correct clarification questions despite solving well-specified versions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes underspecified queries as CSP with missing variables

Introduces QuestBench for evaluating LLMs' question-asking ability

Tests models on logic, planning, and math tasks with gaps

🔎 Similar Papers

CuriousLLM: Elevating Multi-Document Question Answering with LLM-Enhanced Knowledge Graph Reasoning