XBIDetective: Leveraging Vision Language Models for Identifying Cross-Browser Visual Inconsistencies

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This paper addresses cross-browser inconsistencies (XBIs)—particularly those involving dynamic interactive elements and advertisements, which evade detection by conventional DOM- or image-based comparison techniques. We propose the first end-to-end, vision-language model (VLM)-based automated detection method for XBIs. By fine-tuning a VLM on paired screenshots rendered in Firefox and Chrome, our approach pioneers the application of VLMs to XBI identification, enabling robust modeling of semantic-level visual discrepancies. Experimental evaluation demonstrates an overall XBI detection accuracy of 79%, with 84% and 85% accuracy on dynamic elements and advertisements, respectively. Our key contributions include: (1) moving beyond static structural comparison to capture behavior-aware visual differences; (2) supporting regression testing, large-scale cross-browser monitoring; and (3) enabling automated defect triage. This work establishes a novel paradigm for browser rendering quality assurance.

Technology Category

Application Category

📝 Abstract

Browser rendering bugs can be challenging to detect for browser developers, as they may be triggered by very specific conditions that are exhibited on only a very small subset of websites. Cross-browser inconsistencies (XBIs), variations in how a website is interpreted and displayed on different browsers, can be helpful guides to detect such rendering bugs. Although visual and Document Object Model (DOM)-based analysis techniques exist for detecting XBIs, they often struggle with dynamic and interactive elements. In this study, we discuss our industry experience with using vision language models (VLMs) to identify XBIs. We present the XBIDetective tool which automatically captures screenshots of a website in Mozilla Firefox and Google Chrome, and analyzes them with a VLM for XBIs. We evaluate XBIDetective's performance with an off-the-shelf and a fine-tuned VLM on 1,052 websites. We show that XBIDetective can identify cross-browser discrepancies with 79% accuracy and detect dynamic elements and advertisements with 84% and 85% accuracy, respectively, when using the fine-tuned VLM. We discuss important lessons learned, and we present several potential practical use cases for XBIDetective, including automated regression testing, large-scale monitoring of websites, and rapid triaging of XBI bug reports.

Problem

Research questions and friction points this paper is trying to address.

Detects cross-browser visual inconsistencies in website rendering

Identifies dynamic and interactive elements causing display variations

Automates visual bug detection for regression testing and monitoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision language models to detect cross-browser inconsistencies

Automates screenshot capture and analysis across Firefox and Chrome

Fine-tunes VLM for high accuracy in identifying dynamic elements

🔎 Similar Papers

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach