Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

πŸ“… 2025-05-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses two critical bottlenecks in Critical Question Generation (CQs-Gen): the scarcity of high-quality annotated data and the lack of reliable automatic evaluation metrics. To tackle these, we propose a systematic solution: (1) We construct the first large-scale, human-annotated CQs dataset, covering multi-domain argumentative texts and explicitly annotating implicit assumptions. (2) We design a zero-shot, reference-based automatic evaluation method leveraging large language models (LLMs), which significantly outperforms traditional metrics (e.g., BLEU, ROUGE) on relevance and diversity, and achieves strong correlation with human judgments (Spearman’s ρ > 0.85). (3) We release an open-source benchmark platform with a real-time leaderboard. Zero-shot evaluation across 11 mainstream LLMs reveals their pervasive deficiency in generating logically challenging questions. All code, data, and evaluation frameworks are publicly available, fostering sustainable advancement in automated critical thinking research.

Technology Category

Application Category

πŸ“ Abstract
The task of Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose assumptions and challenge the reasoning in arguments. Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards. This work presents a comprehensive approach to support the development and benchmarking of systems for this task. We construct the first large-scale manually-annotated dataset. We also investigate automatic evaluation methods and identify a reference-based technique using large language models (LLMs) as the strategy that best correlates with human judgments. Our zero-shot evaluation of 11 LLMs establishes a strong baseline while showcasing the difficulty of the task. Data, code, and a public leaderboard are provided to encourage further research not only in terms of model performance, but also to explore the practical benefits of CQs-Gen for both automated reasoning and human critical thinking.
Problem

Research questions and friction points this paper is trying to address.

Lack of datasets for Critical Questions Generation (CQs-Gen) task
Absence of automatic evaluation standards for CQs-Gen
Difficulty in generating critical questions challenging reasoning in arguments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructing first large-scale manually-annotated dataset
Using LLMs for reference-based automatic evaluation
Establishing zero-shot evaluation baseline for 11 LLMs
πŸ”Ž Similar Papers
No similar papers found.