MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional task-oriented dialogue systems rely on customized backend APIs, limiting their applicability to GUI-based frontends lacking such interfaces—a key practical bottleneck. This paper introduces the first multimodal task-oriented dialogue framework designed for GUI interaction, enabling goal completion directly from UI screenshots and natural language instructions—without requiring backend APIs. Our contributions are threefold: (1) We present MMWOZ, the first large-scale, GUI-action-oriented multimodal dialogue dataset; (2) We propose MATE, a baseline model that jointly encodes visual and textual inputs to perform end-to-end state tracking and action prediction; (3) We design an automated method for generating executable GUI operation instructions. Experiments demonstrate substantial improvements in task completion rate and fidelity to real-world UI interactions, establishing a scalable technical pathway toward API-free intelligent agents.

Technology Category

Application Category

📝 Abstract
Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.
Problem

Research questions and friction points this paper is trying to address.

Bridging the gap between traditional dialogue systems and GUI-based interfaces
Creating multimodal dataset combining web page snapshots with operation instructions
Developing practical agents that interact with graphical interfaces for task completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed web-style GUI as front-end interface
Automated conversion of dialogue states to GUI operations
Proposed MATE multimodal baseline model for task dialogues
🔎 Similar Papers
No similar papers found.
P
Pu-Hai Yang
School of Artificial Intelligence, Anhui University, Hefei, China
H
Heyan Huang
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Heng-Da Xu
Heng-Da Xu
Ph.D Student, Beijing Institute of Technology
NLPdialogue system
F
Fanshu Sun
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Xian-Ling Mao
Xian-Ling Mao
Beijing Institute of Technology
Web Data MiningInformation ExtractionQA & DialogueTopic ModelingLearn to Hashing
Chaoxu Mu
Chaoxu Mu
Tianjin University
Nonlinear system control and optimizationAdaptive and learning systemsand Smart grid