🤖 AI Summary
This study investigates the feasibility of leveraging large language models (LLMs) to automate cognitive walkthroughs (CW), thereby reducing the cost and effort of usability testing, while evaluating their capacity to simulate human interaction behavior. We propose a multi-round vision-language collaborative prompting framework that jointly interprets UI screenshots and performs task-oriented reasoning to guide GPT-4 and Gemini-2.5-Pro in simulating user navigation paths. A failure-point alignment optimization strategy is introduced to enhance consistency between model-predicted and human-identified usability defects, achieving an F1 score of 0.72. Experiments demonstrate that the models surpass human participants in task completion rate and path optimality, and successfully reproduce 83% of critical usability issues. To our knowledge, this is the first systematic validation of multimodal LLMs for cognitive walkthroughs, establishing a scalable and interpretable paradigm for AI-driven automated usability evaluation.
📝 Abstract
Conducting usability testing like cognitive walkthrough (CW) can be costly. Recent developments in large language models (LLMs), with visual reasoning and UI navigation capabilities, present opportunities to automate CW. We explored whether LLMs (GPT-4 and Gemini-2.5-pro) can simulate human behavior in CW by comparing their walkthroughs with human participants. While LLMs could navigate interfaces and provide reasonable rationales, their behavior differed from humans. LLM-prompted CW achieved higher task completion rates than humans and followed more optimal navigation paths, while identifying fewer potential failure points. However, follow-up studies demonstrated that with additional prompting, LLMs can predict human-identified failure points, aligning their performance with human participants. Our work highlights that while LLMs may not replicate human behaviors exactly, they can be leveraged for scaling usability walkthroughs and providing UI insights, offering a valuable complement to traditional usability testing.