Exploring the Capabilities of Vision-Language Models to Detect Visual Bugs in HTML5Applications

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the challenge of detecting dynamic graphical visual defects—such as layout misalignment, rendering anomalies, and state inconsistencies—in HTML5 `<canvas>` applications, where conventional DOM-based inspection is infeasible. We propose the first end-to-end, vision-language model (VLM)-based automated detection method, which abandons traditional pixel- or feature-level oracle comparisons. Instead, it constructs semantic context by jointly leveraging application functionality descriptions, README documentation, and defect-free reference screenshots to guide multimodal reasoning in VLMs (e.g., LLaVA, Qwen-VL). Through tailored multimodal prompting and a dedicated visual defect classification dataset, our approach achieves 100% application-level accuracy on a benchmark comprising 80 defect-injected and 20 defect-free samples—outperforming prior methods significantly. Our core contribution is the pioneering integration of VLMs into HTML5 dynamic graphics visual testing, enabling asset-free, semantics-driven visual bug identification.

Technology Category

Application Category

📝 Abstract

The HyperText Markup Language 5 (HTML5)is useful for creating visual-centric web applications. However, unlike traditional web applications, HTML5applications render objects onto thebitmap without representing them in the Document Object Model (DOM). Mismatches between the expected and actual visual output of thebitmap are termed visual bugs. Due to the visual-centric nature ofapplications, visual bugs are important to detect because such bugs can render aapplication useless. As we showed in prior work, Asset-Based graphics can provide the ground truth for a visual test oracle. However, manyapplications procedurally generate their graphics. In this paper, we investigate how to detect visual bugs inapplications that use Procedural graphics as well. In particular, we explore the potential of Vision-Language Models (VLMs) to automatically detect visual bugs. Instead of defining an exact visual test oracle, information about the application's expected functionality (the context) can be provided with the screenshot as input to the VLM. To evaluate this approach, we constructed a dataset containing 80 bug-injected screenshots across four visual bug types (Layout, Rendering, Appearance, and State) plus 20 bug-free screenshots from 20applications. We ran experiments with a state-of-the-art VLM using several combinations of text and image context to describe each application's expected functionality. Our results show that by providing the application README(s), a description of visual bug types, and a bug-free screenshot as context, VLMs can be leveraged to detect visual bugs with up to 100% per-application accuracy.

Problem

Research questions and friction points this paper is trying to address.

HTML5 Canvas

Visual Error

Automatic Detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Model

Dynamic Graphics Error Detection

HTML5 Canvas Applications

🔎 Similar Papers

No similar papers found.