FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) benchmarks focus narrowly on vision-to-code translation, neglecting comprehensive front-end engineering capabilities across the full development pipeline. Method: We propose FrontEval—the first end-to-end benchmark covering three sequential stages: design ideation, visual understanding, and code implementation. It introduces a novel two-stage webpage cleaning pipeline that preserves real-world design diversity while enforcing coding standards; decomposes front-end workflows into quantifiable, reproducible collaborative tasks; and integrates webpage structure analysis, cross-modal semantic alignment, and standardized HTML reconstruction. Results: Experiments reveal substantial performance gaps between state-of-the-art MLLMs and human developers—particularly in page perception and code generation—where image embedding and responsive layout errors exceed 70%. FrontEval is publicly released, establishing a systematic, scientifically grounded evaluation paradigm for MLLM front-end proficiency.

Technology Category

Application Category

📝 Abstract

Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) extbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in https://github.com/Mikivishy/FullFront.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs across full front-end development pipeline

Assessing webpage design, perception, and code generation tasks

Identifying performance gaps between MLLMs and human experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates MLLMs across full front-end pipeline

Uses two-stage webpage transformation process

Assesses design, perception, and code generation

🔎 Similar Papers

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

2024-09-17arXiv.orgCitations: 8

💼 Related Jobs

Senior Frontend Engineer

Boson AI

Santa Clara HQ

Research Engineer - Perception and Machine Learning