InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

πŸ“… 2026-04-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

227K/year
πŸ€– AI Summary
This work addresses the challenge of multimodal agents blindly executing ambiguous, redundant, or contradictory instructions from non-expert users in low-code settings by introducing the first benchmark for multimodal interactive website generation tailored to non-professional users. The benchmark incorporates a user behavior simulation mechanism grounded in requirements engineering defect taxonomies, generating diverse and realistic user requests through role-driven instruction perturbations. It defines a unified action space encompassing clarification, implementation, verification, and submission, enabling iterative intent refinement and visual feedback validation. Experimental results reveal that state-of-the-art multimodal agents commonly suffer from blind execution, highlighting critical deficiencies in intent understanding and adaptive interaction. This underscores the benchmark’s innovative contribution to advancing dynamic, realistic evaluation of interactive AI systems.
πŸ“ Abstract
With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.
Problem

Research questions and friction points this paper is trying to address.

blind execution
semantic misalignment
multimodal agent
interactive website generation
non-expert users
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal agent
interactive website generation
blind execution
intent refinement
low-code user simulation
πŸ”Ž Similar Papers
No similar papers found.