🤖 AI Summary
Existing multimodal conversational recommendation datasets suffer from reliance on synthetic dialogues, neglect of user history, or absence of fine-grained feedback—limiting both model learning and evaluation. To address this, we introduce FashionDialog, the first real-world, human-to-human multimodal conversational recommendation dataset in the fashion domain, comprising a visual product catalog, rich user profiles and historical interactions, naturally occurring two-party dialogues, and post-conversation multi-dimensional ratings. We propose a novel bidirectional feedback mechanism and context-enriched vision–language alignment annotations, uncovering feature-group-driven recommendation patterns. Furthermore, we establish a systematic evaluation framework targeting preference inference capability. Experiments reveal that current multimodal large models achieve near-human performance in overall preference alignment, yet significantly lag in rating distribution calibration and out-of-dialogue generalization—particularly for unmentioned items.
📝 Abstract
Multimodal conversational recommendation has emerged as a promising paradigm for delivering personalized experiences through natural dialogue enriched by visual and contextual grounding. Yet, current multimodal conversational recommendation datasets remain limited: existing resources either simulate conversations, omit user history, or fail to collect sufficiently detailed feedback, all of which constrain the types of research and evaluation they support.
To address these gaps, we introduce VOGUE, a novel dataset of 60 humanhuman dialogues in realistic fashion shopping scenarios. Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and histories, and post-conversation ratings from both Seekers and Assistants. This design enables rigorous evaluation of conversational inference, including not only alignment between predicted and ground-truth preferences, but also calibration against full rating distributions and comparison with explicit and implicit user satisfaction signals.
Our initial analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue. For example, recommenders frequently suggest items simultaneously in feature-based groups, which creates distinct conversational phases bridged by Seeker critiques and refinements. Benchmarking multimodal large language models against human recommenders shows that while MLLMs approach human-level alignment in aggregate, they exhibit systematic distribution errors in reproducing human ratings and struggle to generalize preference inference beyond explicitly discussed items. These findings establish VOGUE as both a unique resource for studying multimodal conversational systems and as a challenge dataset beyond the current recommendation capabilities of existing top-tier multimodal foundation models such as GPT-4o-mini, GPT-5-mini, and Gemini-2.5-Flash.