CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing image captioning models predominantly rely on supervised fine-tuning (SFT), which suffers from high annotation costs, limited scalability of human- or proprietary-labeled data, overfitting to canonical references, and insufficient generalization and descriptive diversity. To address these limitations, we propose CapRL—the first method to introduce verifiable reward-based reinforcement learning into subjective image captioning. CapRL innovatively employs a non-vision language model to evaluate the accuracy of multiple-choice question answering based on generated captions, yielding an objective, computationally tractable reward signal. It adopts a decoupled two-stage framework: a large vision-language model generates captions, while a pure language model evaluates them and provides reward feedback. Evaluated across 12 benchmarks, CapRL significantly outperforms SFT baselines (average +8.4%). Its pre-trained CapRL-5M variant achieves performance on par with Qwen2.5-VL-72B under Prism evaluation, demonstrating superior accuracy and descriptive diversity.

Technology Category

Application Category

📝 Abstract

Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.

Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of supervised fine-tuning for image captioning models

Designing objective reward functions for subjective caption quality evaluation

Enhancing caption utility for non-visual question answering tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for image captioning

Rewards captions enabling accurate question answering

Employs decoupled two-stage pipeline for training

🔎 Similar Papers

No similar papers found.