Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the limitations of existing GUI criticism models, which rely on binary classification and struggle to distinguish valid actions from semantically similar but invalid distractors due to a lack of fine-grained ranking capability. To overcome this, the paper reframes GUI criticism as a continuous semantic alignment problem grounded in the “functional equivalence hypothesis.” It introduces BBCritic, a novel paradigm that employs two-stage contrastive learning to align instructions and actions within a shared affordance space, thereby recovering the hierarchical structure lost under binary supervision—without requiring additional annotations. The authors also present BBBench, the first fine-grained benchmark for GUI criticism, featuring a four-level action taxonomy. Experiments demonstrate that a 3B-parameter BBCritic model outperforms 7B-scale state-of-the-art binary classifiers without extra labeling and exhibits strong zero-shot transferability across platforms and tasks.
📝 Abstract
Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.
Problem

Research questions and friction points this paper is trying to address.

GUI Critic
Binary Classification
Fine-grained Ranking
Affordance Space
Test-Time Scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Beyond-Binary Critic
Affordance Space
Contrastive Learning
Test-Time Scaling
Functional Equivalence Hypothesis
🔎 Similar Papers
No similar papers found.