WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

๐Ÿ“… 2026-04-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

214K/year
๐Ÿค– AI Summary
Current evaluations of code language models are confined to static text generation and lack comprehensive assessment of visual fidelity, interactive quality, and library-level reasoning in web development. This work proposes the first multimodal benchmark for web coding that encompasses text, image, and video inputs across generation, editing, and repair tasks, closely simulating real-world development workflows. It introduces a novel human-in-the-loop tiered test suite and a unified end-to-end evaluation framework, pioneering an Agent-as-a-Judge paradigm powered by browser-based automated execution, MCP-driven interactive exploration, and synthetic test case generation. Experiments reveal that closed-source models generally outperform open-source counterparts; repair tasks exhibit stronger interaction preservation yet pose greater difficulty; aesthetic quality remains a key bottleneck for open-source models; and framework choice significantly impacts performanceโ€”Vue presents the greatest challenge, while React and native HTML/CSS/JS each demonstrate task-dependent advantages.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.
Problem

Research questions and friction points this paper is trying to address.

multimodal evaluation
web coding
code language models
visual fidelity
interactive coding
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal evaluation
web coding benchmark
Agent-as-a-Judge
Model Context Protocol
lifecycle assessment
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xinping Lei
Nanjing University
X
Xinyu Che
Nanjing University
J
Junqi Xiong
Nanjing University
C
Chenchen Zhang
Nanjing University
Y
Yukai Huang
Nanjing University
Chenyu Zhou
Chenyu Zhou
University of Southern California
Programming LanguagesProgram VerificationProgram Analysis
Haoyang Huang
Haoyang Huang
JD Explore Academy (present) | StepFun | Microsoft Research
Multimodal & Multilingual Foundation Model
M
Minghao Liu
Kuaishou Technology
L
Letian Zhu
Kuaishou Technology
H
Hongyi Ye
Kuaishou Technology
Jinhua Hao
Jinhua Hao
Kuaishou Technology
Computer VisionGenerative AIFluid Mechanics
Ken Deng
Ken Deng
Kwaipilot Team, Kuaishou Technology
LLMAI4SEAI Agent
Z
Zizheng Zhan
Kuaishou Technology
H
Han Li
Kuaishou Technology
D
Dailin Li
Kuaishou Technology
Yifan Yao
Yifan Yao
Drexel University
Ming Sun
Ming Sun
Kuaishou Tech
Object detectionFine-grainedAutoMLLow-level
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning
J
Jiaheng Liu
Kuaishou Technology