RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

📅 2024-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of multimodal large language models (MLLMs)—namely, their lack of user-specific knowledge and poor adaptability to everyday scenarios—this paper proposes the Retrieval-Augmented Personalization (RAP) framework. RAP constructs a key-value structured user memory bank, designs a multimodal retriever supporting real-time concept editing, and dynamically injects retrieved knowledge into the MLLM’s generation process, enabling zero-shot, fine-tuning-free personalization. Key contributions include: (1) the first editable external memory mechanism for MLLMs; (2) a novel personalized instruction-tuning dataset and training pipeline; and (3) synergistic integration of vision-language alignment with large-scale pretraining to achieve generalization over an unbounded set of visual concepts. Extensive experiments demonstrate significant improvements in generation quality and consistency across personalized image captioning, visual question answering, and visual recognition tasks. The code, datasets, and a series of RAP-MLLM models are publicly released.

Technology Category

Application Category

📝 Abstract
The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://hoar012.github.io/RAP-Project/.
Problem

Research questions and friction points this paper is trying to address.

Enhance MLLMs with user-specific knowledge for daily life applications
Enable real-time personalization via retrieval-augmented database updates
Improve multimodal task performance with personalized training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Key-value database stores user-specific information
Multimodal retriever fetches relevant user data
MLLMs generate responses using retrieved concepts
🔎 Similar Papers
No similar papers found.