WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the lack of a realistic, reference-free, and interpretable benchmark for evaluating web application generation systems driven by genuine user needs. We present the first general-purpose evaluation benchmark based on 1,572 diverse real-world requirements, along with a novel reference-free automated assessment framework that integrates rule-based checks and LLM-as-a-judge methodologies. To enhance alignment with human judgment, we introduce a preference-weighted scoring mechanism, resulting in an interpretable evaluation system comprising nine dimensions and 24 fine-grained metrics. Extensive experiments across 12 leading large language models and two agent-based systems reveal no single model consistently outperforms others across all criteria, thereby offering clear guidance for future improvements in this domain.

Technology Category

Application Category

📝 Abstract

Web applications (web apps) have become a key arena for large language models (LLMs) to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable evaluation results. To address these challenges, we introduce WebCoderBench, the first real-world-collected, generalizable, and interpretable benchmark for web app generation. WebCoderBench comprises 1,572 real user requirements, covering diverse modalities and expression styles that reflect realistic user intentions. WebCoderBench provides 24 fine-grained evaluation metrics across 9 perspectives, combining rule-based and LLM-as-a-judge paradigm for fully automated, objective, and general evaluation. Moreover, WebCoderBench adopts human-preference-aligned weights over metrics to yield interpretable overall scores. Experiments across 12 representative LLMs and 2 LLM-based agents show that there exists no dominant model across all evaluation metrics, offering an opportunity for LLM developers to optimize their models in a targeted manner for a more powerful version.

Problem

Research questions and friction points this paper is trying to address.

web application generation

large language models

benchmark

evaluation metrics

interpretable evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

WebCoderBench

web application generation

LLM-as-a-judge