🤖 AI Summary
This study investigates curiosity-driven natural questioning behavior in real-world settings, with a focus on causal questions embedded in complex, open-ended inquiries. To this end, we construct NatQuest—a large-scale, multi-source dataset of 13,500 authentic human questions drawn from web search, interpersonal conversations, and human–AI interactions. We propose the first LLM-based iterative prompt optimization framework for high-precision causal question identification (42% of all questions), systematically characterizing their linguistic triggers (e.g., “why”, “how does X cause Y”), cognitive complexity, and source-specific distribution patterns. Our contributions include: (1) releasing the first large-scale, multi-source natural question dataset; (2) developing six lightweight supervised classification models for causal question detection; and (3) open-sourcing an extensible causal question identification toolkit—providing foundational resources for curiosity modeling and evaluation of causal reasoning capabilities in large language models.
📝 Abstract
The recent development of Large Language Models (LLMs) has changed our role in interacting with them. Instead of primarily testing these models with questions we already know the answers to, we now use them to explore questions where the answers are unknown to us. This shift, which hasn't been fully addressed in existing datasets, highlights the growing need to understand naturally occurring human questions - that are more complex, open-ended, and reflective of real-world needs. To this end, we present NatQuest, a collection of 13,500 naturally occurring questions from three diverse sources: human-to-search-engine queries, human-to-human interactions, and human-to-LLM conversations. Our comprehensive collection enables a rich understanding of human curiosity across various domains and contexts. Our analysis reveals a significant presence of causal questions (up to 42%) within the dataset, for which we develop an iterative prompt improvement framework to identify all causal queries, and examine their unique linguistic properties, cognitive complexity, and source distribution. We also lay the groundwork to explore LLM performance on these questions and provide six efficient classification models to identify causal questions at scale for future work.