RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) struggle to generate faithful and personalized descriptions for complex, multi-concept images, primarily due to the scarcity and high acquisition cost of high-quality annotated data. To address this, we propose RL-PostTune—the first reinforcement learning (RL)-based post-training framework for personalized image captioning that requires no large-scale human annotations. Our method jointly optimizes visual perception and language generation policies, leveraging fine-grained visual feedback as reward signals to guide caption generation. Experiments demonstrate substantial improvements in both multi-concept scene understanding and personalized expression, consistently outperforming supervised fine-tuning baselines across multiple benchmarks. Our core contributions are threefold: (1) the first application of RL to personalized MLLM post-training; (2) a breakthrough in overcoming data dependency bottlenecks; and (3) empirical validation of vision-guided reward mechanisms for enhancing multimodal generative fidelity.

Technology Category

Application Category

📝 Abstract

Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task.

Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle with personalized image caption generation

Existing post-training methods fail in real-world scenarios

High-quality caption data is costly and difficult to obtain

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-based post-training for MLLM personalization

Enhances visual recognition and generation

Outperforms SFT-based methods in multi-concept captioning

🔎 Similar Papers

No similar papers found.