🤖 AI Summary
Unsupervised image captioning lacks effective frameworks that enable vision-language models (VLMs) to acquire language capabilities without labeled data or external supervision.
Method: This paper proposes LoGIC—the first multi-agent reinforcement learning framework for unsupervised image captioning, grounded in the Lewis signaling game. It introduces a collaborative “speaker–listener” architecture: the speaker generates captions, while the listener interprets them and provides feedback, jointly optimizing VLM behavior on raw, unlabeled images. The framework integrates ViT, GPT-2, and LLM components, trained end-to-end via the GRPO algorithm.
Contribution/Results: Fine-tuning a pretrained model yields BLEU-4 of 46.0—surpassing supervised baselines by +2.0. A lightweight model trained from scratch achieves BLEU-4 of 31.0, outperforming prior unsupervised methods by +10.0. These results empirically validate that communication games effectively elicit emergent linguistic competence in VLMs and demonstrate strong scalability across model sizes.
📝 Abstract
Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under-explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi-agent Reinforcement Learning game. The proposed method consists of two agents, a 'speaker' and a 'listener', with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common-reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a consequence of the agents learning to play the game. We show that using pre-trained VLMs as the 'speaker' and Large Language Model (LLM) for language understanding in the 'listener', we achieved a $46$ BLEU score after fine-tuning using LoGIC without additional labels, a $2$ units advantage in absolute metrics compared to the $44$ BLEU score of the vanilla VLM. Additionally, we replace the VLM from the 'speaker' with lightweight components: (i) a ViT for image perception and (ii) a GPT2 language generation, and train them from scratch using LoGIC, obtaining a $31$ BLEU score in the unsupervised setting, a $10$ points advantage over existing unsupervised image-captioning methods.