🤖 AI Summary
This work addresses the gap in multimodal preference modeling for conversational recommendation, specifically tackling the underutilization of visual capabilities in existing vision-language models (VLMs) for image-driven recommendation. To this end, we introduce the first multimodal conversational recommendation dataset grounded in user-uploaded images—covering books and music—and supporting both title generation and multiple-choice recommendation tasks. We propose Chain-of-Imagery Prompting, a novel prompting paradigm that explicitly models cross-modal mapping from visual affect to semantic content. Through comprehensive evaluation integrating VLMs, text-only baselines, multimodal prompt engineering, and community voting, our method achieves significant performance gains. Crucially, we identify that the negligible performance difference between pure language models and VLMs stems from ineffective visual signal decoding—not inherent model limitations. All code and data are publicly released.
📝 Abstract
We introduce a multimodal dataset where users express preferences through images. These images encompass a broad spectrum of visual expressions ranging from landscapes to artistic depictions. Users request recommendations for books or music that evoke similar feelings to those captured in the images, and recommendations are endorsed by the community through upvotes. This dataset supports two recommendation tasks: title generation and multiple-choice selection. Our experiments with large foundation models reveal their limitations in these tasks. Particularly, vision-language models show no significant advantage over language-only counterparts that use descriptions, which we hypothesize is due to underutilized visual capabilities. To better harness these abilities, we propose the chain-of-imagery prompting, which results in notable improvements. We release our code and datasets.