From Imitation to Discrimination: Progressive Curriculum Learning for Robust Web Navigation

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the limitations of current text-based web agents, which struggle to distinguish distracting elements and exhibit poor generalization in real-world, noisy, and heterogeneous HTML environments. To overcome these challenges, the authors introduce the Triton dataset and propose a three-stage curriculum learning framework that begins with imitation learning and progressively incorporates discriminative capabilities and long-range consistency modeling. The approach features a novel structure-semantic hard negative mining strategy and a dual-agent consensus synthesis mechanism, combined with supervised fine-tuning (SFT), odds ratio preference optimization (ORPO), and group relative policy optimization (GRPO) to train a 32B-parameter model. The resulting Triton-GRPO-32B achieves a step success rate of 58.7% on Mind2Web, significantly outperforming closed-source models such as GPT-4.5 and Claude-4.5.

Technology Category

Application Category

📝 Abstract

Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.

Problem

Research questions and friction points this paper is trying to address.

web navigation

discrimination

generalization

HTML heterogeneity

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Curriculum Learning

Hard Negative Mining

Odds Ratio Preference Optimization (ORPO)