Synthetic Cognitive Walkthrough: Aligning Large Language Model Performance with Human Cognitive Walkthrough

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility of leveraging large language models (LLMs) to automate cognitive walkthroughs (CW), thereby reducing the cost and effort of usability testing, while evaluating their capacity to simulate human interaction behavior. We propose a multi-round vision-language collaborative prompting framework that jointly interprets UI screenshots and performs task-oriented reasoning to guide GPT-4 and Gemini-2.5-Pro in simulating user navigation paths. A failure-point alignment optimization strategy is introduced to enhance consistency between model-predicted and human-identified usability defects, achieving an F1 score of 0.72. Experiments demonstrate that the models surpass human participants in task completion rate and path optimality, and successfully reproduce 83% of critical usability issues. To our knowledge, this is the first systematic validation of multimodal LLMs for cognitive walkthroughs, establishing a scalable and interpretable paradigm for AI-driven automated usability evaluation.

Technology Category

Application Category

📝 Abstract
Conducting usability testing like cognitive walkthrough (CW) can be costly. Recent developments in large language models (LLMs), with visual reasoning and UI navigation capabilities, present opportunities to automate CW. We explored whether LLMs (GPT-4 and Gemini-2.5-pro) can simulate human behavior in CW by comparing their walkthroughs with human participants. While LLMs could navigate interfaces and provide reasonable rationales, their behavior differed from humans. LLM-prompted CW achieved higher task completion rates than humans and followed more optimal navigation paths, while identifying fewer potential failure points. However, follow-up studies demonstrated that with additional prompting, LLMs can predict human-identified failure points, aligning their performance with human participants. Our work highlights that while LLMs may not replicate human behaviors exactly, they can be leveraged for scaling usability walkthroughs and providing UI insights, offering a valuable complement to traditional usability testing.
Problem

Research questions and friction points this paper is trying to address.

Automating cognitive walkthroughs using large language models
Comparing LLM-simulated usability testing with human performance
Enhancing UI insights by aligning LLM predictions with human findings
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs automate cognitive walkthroughs for usability testing
LLMs achieve higher task completion with optimal navigation paths
Enhanced prompting aligns LLM predictions with human failure points