WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the reliability of large language models (LLMs) and multimodal LLMs (MLLMs) as evaluators for web development tasks. To this end, we introduce WebDevJudge—the first systematic benchmark for evaluating LLM/MLLM-based judges—supporting both static (non-interactive) and dynamic (interactive) assessment across functional correctness and user experience dimensions. It features a structured scoring rubric and human preference annotations. Methodologically, our framework integrates MLLMs, agent-based workflows, static code analysis, and dynamic execution in realistic browser environments to enable end-to-end automated evaluation. Extensive experiments reveal fundamental limitations of current LLM evaluators in recognizing functional equivalence, verifying task feasibility, and mitigating bias—exhibiting significant performance gaps relative to human experts. This work establishes an empirical foundation and a scalable technical framework for automated web development assessment.

Technology Category

Application Category

📝 Abstract

The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a significant challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios. Code and data are available at https://github.com/lcy2723/WebDevJudge.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs as judges for web development quality assessment

Assessing reliability in open-ended tasks with dynamic environments

Identifying limitations in recognizing functional equivalence and task feasibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces WebDevJudge benchmark for LLM evaluation

Supports both static and dynamic web environment testing

Identifies model limitations in functional equivalence recognition

🔎 Similar Papers

No similar papers found.

Apple

Seattle, United States of America

Authors to Follow