"How Do I ...?": Procedural Questions Predominate Student-LLM Chatbot Conversations

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This study investigates the types of questions students pose when interacting with educational large language models (LLMs), with a particular focus on how obstacle-driven queries influence instructional efficacy. Drawing on 6,113 student messages from self-directed learning and assessment contexts, the research presents the first systematic comparison of annotation agreement and reliability between 11 distinct LLMs and three human raters across four established question classification frameworks. Findings indicate that procedural questions predominate in both settings, especially during exam preparation. LLMs demonstrate moderate-to-high inter-rater consistency that surpasses human agreement. However, existing taxonomies prove insufficient for capturing the semantic complexity of compound or multifaceted questions, highlighting an urgent need to integrate finer-grained analytical approaches such as conversation analysis.

Technology Category

Application Category

📝 Abstract

Providing scaffolding through educational chatbots built on Large Language Models (LLM) has potential risks and benefits that remain an open area of research. When students navigate impasses, they ask for help by formulating impasse-driven questions. Within interactions with LLM chatbots, such questions shape the user prompts and drive the pedagogical effectiveness of the chatbot's response. This paper focuses on such student questions from two datasets of distinct learning contexts: formative self-study, and summative assessed coursework. We analysed 6,113 messages from both learning contexts, using 11 different LLMs and three human raters to classify student questions using four existing schemas. On the feasibility of using LLMs as raters, results showed moderate-to-good inter-rater reliability, with higher consistency than human raters. The data showed that'procedural'questions predominated in both learning contexts, but more so when students prepare for summative assessment. These results provide a basis on which to use LLMs for classification of student questions. However, we identify clear limitations in both the ability to classify with schemas and the value of doing so: schemas are limited and thus struggle to accommodate the semantic richness of composite prompts, offering only partial understanding the wider risks and benefits of chatbot integration. In the future, we recommend an analysis approach that captures the nuanced, multi-turn nature of conversation, for example, by applying methods from conversation analysis in discursive psychology.

Problem

Research questions and friction points this paper is trying to address.

procedural questions

LLM chatbots

question classification

educational scaffolding

conversation analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

student question classification

inter-rater reliability