🤖 AI Summary
This work addresses the limitations of existing GUI criticism models, which rely on binary classification and struggle to distinguish valid actions from semantically similar but invalid distractors due to a lack of fine-grained ranking capability. To overcome this, the paper reframes GUI criticism as a continuous semantic alignment problem grounded in the “functional equivalence hypothesis.” It introduces BBCritic, a novel paradigm that employs two-stage contrastive learning to align instructions and actions within a shared affordance space, thereby recovering the hierarchical structure lost under binary supervision—without requiring additional annotations. The authors also present BBBench, the first fine-grained benchmark for GUI criticism, featuring a four-level action taxonomy. Experiments demonstrate that a 3B-parameter BBCritic model outperforms 7B-scale state-of-the-art binary classifiers without extra labeling and exhibits strong zero-shot transferability across platforms and tasks.
📝 Abstract
Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.