VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal conversational recommendation datasets suffer from reliance on synthetic dialogues, neglect of user history, or absence of fine-grained feedback—limiting both model learning and evaluation. To address this, we introduce FashionDialog, the first real-world, human-to-human multimodal conversational recommendation dataset in the fashion domain, comprising a visual product catalog, rich user profiles and historical interactions, naturally occurring two-party dialogues, and post-conversation multi-dimensional ratings. We propose a novel bidirectional feedback mechanism and context-enriched vision–language alignment annotations, uncovering feature-group-driven recommendation patterns. Furthermore, we establish a systematic evaluation framework targeting preference inference capability. Experiments reveal that current multimodal large models achieve near-human performance in overall preference alignment, yet significantly lag in rating distribution calibration and out-of-dialogue generalization—particularly for unmentioned items.

Technology Category

Application Category

📝 Abstract
Multimodal conversational recommendation has emerged as a promising paradigm for delivering personalized experiences through natural dialogue enriched by visual and contextual grounding. Yet, current multimodal conversational recommendation datasets remain limited: existing resources either simulate conversations, omit user history, or fail to collect sufficiently detailed feedback, all of which constrain the types of research and evaluation they support. To address these gaps, we introduce VOGUE, a novel dataset of 60 humanhuman dialogues in realistic fashion shopping scenarios. Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and histories, and post-conversation ratings from both Seekers and Assistants. This design enables rigorous evaluation of conversational inference, including not only alignment between predicted and ground-truth preferences, but also calibration against full rating distributions and comparison with explicit and implicit user satisfaction signals. Our initial analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue. For example, recommenders frequently suggest items simultaneously in feature-based groups, which creates distinct conversational phases bridged by Seeker critiques and refinements. Benchmarking multimodal large language models against human recommenders shows that while MLLMs approach human-level alignment in aggregate, they exhibit systematic distribution errors in reproducing human ratings and struggle to generalize preference inference beyond explicitly discussed items. These findings establish VOGUE as both a unique resource for studying multimodal conversational systems and as a challenge dataset beyond the current recommendation capabilities of existing top-tier multimodal foundation models such as GPT-4o-mini, GPT-5-mini, and Gemini-2.5-Flash.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in multimodal conversational recommendation datasets for fashion
Enabling rigorous evaluation of conversational inference and user preference alignment
Studying visually grounded dialogue dynamics and multimodal recommendation capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced VOGUE dataset with human dialogues
Paired dialogues with visual catalogue and metadata
Enabled multimodal conversational recommendation evaluation
D
David Guo
University of Toronto
M
Minqi Sun
University of Waterloo
Y
Yilun Jiang
University of Waterloo
J
Jiazhou Liang
University of Toronto
Scott Sanner
Scott Sanner
University of Toronto
Artificial IntelligenceMachine LearningInformation Retrieval