🤖 AI Summary
To address the bottleneck of information-seeking agents in accessing deep-web content, this paper proposes a nested browser operation framework that decouples action control from page exploration, enabling fine-grained, realistic web interactions. Unlike conventional API- or URL-level retrieval, the framework defines a minimal yet complete set of browser actions (e.g., click, scroll, form filling) and integrates them within a ReAct-style function-calling architecture, hierarchical state abstraction, and dynamic content summarization. This design facilitates end-to-end controllable browsing. Evaluated on multiple deep-retrieval benchmarks, the method achieves substantial improvements in task completion rate (+18.3%) and answer accuracy (+22.7%). Results demonstrate its robustness, generalizability, and effectiveness in modeling complex, multi-step interactive scenarios—particularly those requiring iterative navigation, dynamic content loading, and structured form submission.
📝 Abstract
Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.