What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses a critical limitation in current training-free GUI grounding methods, which rely on multiple inference passes and struggle to support progressive interaction among visual tokens, while also being unable to correct erroneous candidate element selections during decoding. We reveal that GUI grounding in vision-language models follows a two-stage paradigm: a prefill stage that determines candidate UI elements and a decoding stage that merely fine-tunes coordinates, making prefill the performance bottleneck. To overcome this, we propose Re-Prefill—a training-free approach that leverages an attention-guided secondary prefill mechanism. By exploiting sustained high cross-layer query attention to the target region, Re-Prefill extracts initial hypotheses and re-injects them into the joint reasoning of input and instruction hidden states, enabling the model to recalibrate its decision before coordinate generation. Our method consistently improves performance across four vision-language models and five benchmarks, achieving up to a 4.3% gain on ScreenSpot-Pro.

📝 Abstract

Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at https://github.com/linjiaping1/Re-Prefill.

Problem

Research questions and friction points this paper is trying to address.

GUI grounding

Vision-Language Models

prefill stage

candidate selection

visual tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI grounding

Vision-Language Models

prefill stage