Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Unsupervised image captioning lacks effective frameworks that enable vision-language models (VLMs) to acquire language capabilities without labeled data or external supervision. Method: This paper proposes LoGIC—the first multi-agent reinforcement learning framework for unsupervised image captioning, grounded in the Lewis signaling game. It introduces a collaborative “speaker–listener” architecture: the speaker generates captions, while the listener interprets them and provides feedback, jointly optimizing VLM behavior on raw, unlabeled images. The framework integrates ViT, GPT-2, and LLM components, trained end-to-end via the GRPO algorithm. Contribution/Results: Fine-tuning a pretrained model yields BLEU-4 of 46.0—surpassing supervised baselines by +2.0. A lightweight model trained from scratch achieves BLEU-4 of 31.0, outperforming prior unsupervised methods by +10.0. These results empirically validate that communication games effectively elicit emergent linguistic competence in VLMs and demonstrate strong scalability across model sizes.

Technology Category

Application Category

📝 Abstract

Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under-explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi-agent Reinforcement Learning game. The proposed method consists of two agents, a 'speaker' and a 'listener', with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common-reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a consequence of the agents learning to play the game. We show that using pre-trained VLMs as the 'speaker' and Large Language Model (LLM) for language understanding in the 'listener', we achieved a $46$ BLEU score after fine-tuning using LoGIC without additional labels, a $2$ units advantage in absolute metrics compared to the $44$ BLEU score of the vanilla VLM. Additionally, we replace the VLM from the 'speaker' with lightweight components: (i) a ViT for image perception and (ii) a GPT2 language generation, and train them from scratch using LoGIC, obtaining a $31$ BLEU score in the unsupervised setting, a $10$ points advantage over existing unsupervised image-captioning methods.

Problem

Research questions and friction points this paper is trying to address.

Improving image captioning without additional labeled data

Exploring unsupervised methods for Vision Language Models

Enhancing communication between AI agents for better captioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent Reinforcement Learning game design

Pre-trained VLMs and LLMs integration

Lightweight components for unsupervised training

🔎 Similar Papers

No similar papers found.