Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation

📅 2024-05-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the gap in multimodal preference modeling for conversational recommendation, specifically tackling the underutilization of visual capabilities in existing vision-language models (VLMs) for image-driven recommendation. To this end, we introduce the first multimodal conversational recommendation dataset grounded in user-uploaded images—covering books and music—and supporting both title generation and multiple-choice recommendation tasks. We propose Chain-of-Imagery Prompting, a novel prompting paradigm that explicitly models cross-modal mapping from visual affect to semantic content. Through comprehensive evaluation integrating VLMs, text-only baselines, multimodal prompt engineering, and community voting, our method achieves significant performance gains. Crucially, we identify that the negligible performance difference between pure language models and VLMs stems from ineffective visual signal decoding—not inherent model limitations. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

We introduce a multimodal dataset where users express preferences through images. These images encompass a broad spectrum of visual expressions ranging from landscapes to artistic depictions. Users request recommendations for books or music that evoke similar feelings to those captured in the images, and recommendations are endorsed by the community through upvotes. This dataset supports two recommendation tasks: title generation and multiple-choice selection. Our experiments with large foundation models reveal their limitations in these tasks. Particularly, vision-language models show no significant advantage over language-only counterparts that use descriptions, which we hypothesize is due to underutilized visual capabilities. To better harness these abilities, we propose the chain-of-imagery prompting, which results in notable improvements. We release our code and datasets.

Problem

Research questions and friction points this paper is trying to address.

Exploring multimodal dataset for conversational recommendation tasks

Assessing limitations of foundation models in recommendation tasks

Proposing chain-of-imagery prompting to improve visual capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset for image-based preferences

Chain-of-imagery prompting improves performance

Vision-language models underutilize visual capabilities

🔎 Similar Papers

MMREC: LLM Based Multi-Modal Recommender System