🤖 AI Summary
Existing web agents suffer significant performance degradation when confronted with the temporal evolution of web interfaces, primarily due to the absence of robust evaluation frameworks and training mechanisms that account for cross-era UI variations. To address this gap, this work introduces TimeWarp, a benchmark that leverages containerization to reconstruct six distinct historical UI versions across three websites. Furthermore, we propose TimeTraj, an algorithm that employs multi-version plan distillation to generate robust action trajectories, replacing conventional single-version behavioral cloning. Experimental results demonstrate that our approach substantially enhances cross-UI generalization: success rates improve from 20.4% to 37.7% for Qwen-3 4B and from 0% to 27.0% for Llama-3.1 8B, marking the first systematic effort to enable robust evaluation and enhancement of web agents across evolving historical interfaces.
📝 Abstract
The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents'vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4\%\rightarrow37.7\%$ for Qwen-3 4B and $0\%\rightarrow27.0\%$ for Llama-3.1 8B models. We hope our work helps researchers study generalization across web designs and unlock a new paradigm for collecting plans rather than trajectories, thereby improving the robustness of web agents.