Image is All You Need: Towards Efficient and Effective Large Language Model-Based Recommender Systems

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the efficiency–effectiveness trade-off in item representation for large language model (LLM)-based recommender systems, this paper proposes, for the first time, an image-driven multimodal LLM recommendation framework—replacing textual item descriptions with visual inputs as the primary modality. Our approach integrates a CLIP-based vision encoder, cross-modal alignment, instruction tuning, and a lightweight token-mapping mechanism to circumvent inherent limitations of text-based methods, such as incomplete attribute extraction and sensitivity to textual noise. Extensive experiments demonstrate consistent improvements across multiple benchmarks: average Recall@10 increases by 6.8%, inference speed accelerates by 37%, and token consumption reduces by 52%, while robustness to input noise is significantly enhanced. This work establishes a novel image-centric paradigm for LLM-based recommendation and provides a scalable, semantically rich technical pathway for efficient multimodal recommendation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have recently emerged as a powerful backbone for recommender systems. Existing LLM-based recommender systems take two different approaches for representing items in natural language, i.e., Attribute-based Representation and Description-based Representation. In this work, we aim to address the trade-off between efficiency and effectiveness that these two approaches encounter, when representing items consumed by users. Based on our interesting observation that there is a significant information overlap between images and descriptions associated with items, we propose a novel method, Image is all you need for LLM-based Recommender system (I-LLMRec). Our main idea is to leverage images as an alternative to lengthy textual descriptions for representing items, aiming at reducing token usage while preserving the rich semantic information of item descriptions. Through extensive experiments, we demonstrate that I-LLMRec outperforms existing methods in both efficiency and effectiveness by leveraging images. Moreover, a further appeal of I-LLMRec is its ability to reduce sensitivity to noise in descriptions, leading to more robust recommendations.

Problem

Research questions and friction points this paper is trying to address.

Address efficiency-effectiveness trade-off in LLM-based recommenders

Leverage images to reduce token usage in item representation

Enhance robustness by reducing sensitivity to noisy descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages images instead of text for item representation

Reduces token usage while preserving semantic information

Enhances recommendation robustness by reducing noise sensitivity

🔎 Similar Papers

MMREC: LLM Based Multi-Modal Recommender System