Synthetic Heuristic Evaluation: A Comparison between AI- and Human-Powered Usability Evaluation

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Traditional heuristic evaluation relies heavily on expert and user involvement, resulting in high costs and low efficiency. This paper proposes a synthetic heuristic evaluation method based on multimodal large language models (MLLMs), the first systematic comparative study of AI versus human experts in usability problem identification. The method automatically aligns interface screenshots with Nielsen’s ten heuristics via vision-language understanding, enabling end-to-end automated assessment. Experimental results show that our approach outperforms human experts in detecting layout defects and maintaining cross-task consistency; it identifies 73% and 77% of known usability issues in two real-world applications—surpassing five experienced evaluators. However, limitations persist in recognizing UI conventions and cross-page usability issues. This work establishes a novel, low-cost, and scalable paradigm for automated user experience evaluation.

Technology Category

Application Category

📝 Abstract

Usability evaluation is crucial in human-centered design but can be costly, requiring expert time and user compensation. In this work, we developed a method for synthetic heuristic evaluation using multimodal LLMs' ability to analyze images and provide design feedback. Comparing our synthetic evaluations to those by experienced UX practitioners across two apps, we found our evaluation identified 73% and 77% of usability issues, which exceeded the performance of 5 experienced human evaluators (57% and 63%). Compared to human evaluators, the synthetic evaluation's performance maintained consistent performance across tasks and excelled in detecting layout issues, highlighting potential attentional and perceptual strengths of synthetic evaluation. However, synthetic evaluation struggled with recognizing some UI components and design conventions, as well as identifying across screen violations. Additionally, testing synthetic evaluations over time and accounts revealed stable performance. Overall, our work highlights the performance differences between human and LLM-driven evaluations, informing the design of synthetic heuristic evaluations.

Problem

Research questions and friction points this paper is trying to address.

Comparing AI and human usability evaluation performance

Developing synthetic heuristic evaluation using multimodal LLMs

Identifying strengths and weaknesses of AI in usability assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs analyze images for design feedback

Synthetic evaluation detects 73-77% usability issues

Stable performance across tasks and time

🔎 Similar Papers

EvAlignUX: Advancing UX Research through LLM-Supported Exploration of Evaluation Metrics