🤖 AI Summary
Existing web automation methods neglect historical page states, relying solely on the current state, past actions, and natural language instructions—leading to sparse information encoding and insufficient utilization of historical context under long-sequence inputs. To address this, we propose a historical state compression mechanism that systematically models and compresses verbose historical page states into fixed-length, task-relevant representations, integrated within a language model–based architecture. Our approach employs attention mechanisms to dynamically prioritize task-critical features from compressed histories. We evaluate the method on the Mind2Web and WebLINX benchmarks. Compared to a baseline omitting historical states, our method achieves absolute accuracy improvements of 1.2–5.4 percentage points across multiple metrics, demonstrating substantially enhanced cross-step contextual modeling capability.
📝 Abstract
Language models have led to a leap forward in web automation. The current web automation approaches take the current web state, history actions, and language instruction as inputs to predict the next action, overlooking the importance of history states. However, the highly verbose nature of web page states can result in long input sequences and sparse information, hampering the effective utilization of history states. In this paper, we propose a novel web history compressor approach to turbocharge web automation using history states. Our approach employs a history compressor module that distills the most task-relevant information from each history state into a fixed-length short representation, mitigating the challenges posed by the highly verbose history states. Experiments are conducted on the Mind2Web and WebLINX datasets to evaluate the effectiveness of our approach. Results show that our approach obtains 1.2-5.4% absolute accuracy improvements compared to the baseline approach without history inputs.