Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the limited cognitive reasoning capability of web agents by proposing Web-CogKnowledge, a knowledge-driven two-stage framework. In Stage I, it models the web environment through structured knowledge representation—distinguishing factual, conceptual, and procedural knowledge. In Stage II, it performs chain-of-thought reasoning and action planning grounded in this knowledge. To support evaluation, the authors introduce Web-CogDataset—the first benchmark dataset explicitly designed for cognitive reasoning on the web—and Web-CogBench, a comprehensive evaluation benchmark. The framework integrates multimodal large language models, knowledge-augmented chain-of-thought reasoning, and procedural-knowledge-guided action exploration. Experiments demonstrate that Web-CogKnowledge significantly outperforms existing baselines on unseen tasks, particularly excelling in complex reasoning tasks requiring structured knowledge support, thereby exhibiting superior generalization and robustness.

Technology Category

Application Category

📝 Abstract
Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent's conceptual grounding-the "nouns" upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at https://github.com/Gnonymous/Web-CogReasoner
Problem

Research questions and friction points this paper is trying to address.

Enhancing web agents with cognitive reasoning through knowledge acquisition
Structuring knowledge into Factual, Conceptual, and Procedural types for learning
Developing a knowledge-driven Chain-of-Thought framework for web agent reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Web-CogKnowledge Framework categorizes Factual, Conceptual, Procedural knowledge
Web-CogDataset provides structured knowledge from 14 real-world websites
Knowledge-driven Chain-of-Thought reasoning enhances cognitive processes
🔎 Similar Papers
2023-08-22Frontiers Comput. Sci.Citations: 866
Y
Yuhan Guo
Southwestern University of Finance and Economics
C
Cong Guo
Southwestern University of Finance and Economics
A
Aiwen Sun
Central South University
H
Hongliang He
Westlake University
X
Xinyu Yang
Hithink Research
Y
Yue Lu
Hithink Research
Yingji Zhang
Yingji Zhang
University of Manchester
Computational LinguisticsRepresentation LearningDisentanglementMulti-modal Learning
X
Xuntao Guo
Harbin Institute of Technology
D
Dong Zhang
Hithink Research
Jianzhuang Liu
Jianzhuang Liu
Shenzhen Institutes of Advanced Technology, University of Chinese Academy of Sciences
Computer VisionImage ProcessingAIGCMachine Learning
J
Jiang Duan
Southwestern University of Finance and Economics
Yijia Xiao
Yijia Xiao
University of California, Los Angeles
AI for FinanceAgentsAI for ScienceMultimodal LLM
Liangjian Wen
Liangjian Wen
Southwestern University of Finance and Economics Chengdu, China
Hai-Ming Xu
Hai-Ming Xu
TikTok
Machine LearningComputer Vision
Y
Yong Dai
Fudan University