FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing frontend code generation benchmarks suffer from oversimplified tasks, insufficient test rigor, and lack of end-to-end validation, hindering accurate evaluation of LLMs’ capabilities. To address these limitations, we propose FrontendBench—the first end-to-end evaluation benchmark grounded in real-world frontend development practices. It comprises 148 prompt-test pairs spanning five categories of web components and introduces a novel human-in-the-loop, hierarchical functional task taxonomy. We design an automated assessment framework integrating sandboxed execution with script-based verification, achieving 90.54% agreement with human evaluation. Extensive experiments across multiple state-of-the-art models demonstrate that FrontendBench exhibits strong discriminative power, high reliability, and excellent scalability—effectively enabling rigorous, multimodal evaluation of frontend code generation capabilities.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end validation is absent. These issues hinder the accurate assessment of model performance. To address these challenges, we present FrontendBench, a benchmark co-developed by humans and LLMs. FrontendBench categorizes tasks based on code functionality and incorporates interactive test scenarios, enabling a more comprehensive and practical evaluation of front-end code generation capabilities. The benchmark comprises 148 meticulously crafted prompt-test case pairs spanning five levels of web components, from basic UI elements to complex interactive features. Each task reflects realistic front-end development challenges. Furthermore, we introduce an automatic evaluation framework that executes generated code within a sandbox environment and assesses outcomes using predefined test scripts. This framework achieves a 90.54% agreement rate with expert human evaluations, demonstrating high reliability. We benchmark several state-of-the-art LLMs on FrontendBench and observe substantial performance disparities in handling real-world front-end tasks. These results highlight FrontendBench as a reliable and scalable benchmark, supporting consistent multimodal evaluation and providing a robust foundation for future research in front-end code generation. Our data and code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on realistic front-end development tasks
Addressing limitations in existing benchmarks for code generation
Providing automatic and reliable assessment of generated front-end code
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark co-developed by humans and LLMs
Automatic evaluation with sandbox environment
Interactive test scenarios for practical evaluation
🔎 Similar Papers
No similar papers found.
H
Hongda Zhu
ByteDance, Chengdu, China
Y
Yiwen Zhang
ByteDance, Beijing, China
Bing Zhao
Bing Zhao
SRI International
Natural Language ProcessingMachine LearningOptimizations
J
Jingzhe Ding
ByteDance, Beijing, China
S
Siyao Liu
ByteDance, Beijing, China
T
Tong Liu
ByteDance, Chengdu, China
D
Dandan Wang
ByteDance, Beijing, China
Yanan Liu
Yanan Liu
Lecturer in Shanghai University
In-Sensor ComputingEmbedded AIRobotic Vision and ControlHigh-Speed Vision
Zhaojian Li
Zhaojian Li
Red Cedar Distinguished Associate Professor, Michigan State University
ControlsLearningRoboticsConnected VehiclesSmart Agriculture