Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address inconsistent image descriptions and semantic incoherence in embodied agents operating within dynamic, cluttered environments—caused by multi-view viewpoint variations—this paper proposes the first three-stage self-supervised framework: (1) exploration-driven acquisition of multi-view noisy images; (2) large language model (LLM)-guided cross-view pseudo-label consensus distillation; and (3) end-to-end fine-tuning of vision-language models with integrated contrastive learning. The method requires no human annotations, leveraging autonomous exploration, multi-view consistency modeling, and semantic alignment to significantly improve caption accuracy and cross-view description consistency. On a human-annotated test set, its pseudo-labels achieve higher semantic similarity than state-of-the-art approaches, with substantial gains in caption accuracy and inter-view description consistency. Furthermore, the exploration strategy actively identifies high-disagreement samples, enhancing training robustness.

Technology Category

Application Category

📝 Abstract

We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations available at https://hsp-iit.github.io/embodied-captioning/

Problem

Research questions and friction points this paper is trying to address.

Improving agent's ability to describe objects in environments

Enhancing caption accuracy and consistency across viewpoints

Self-supervised fine-tuning of captioning models via consensus

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning for image captioning

Consensus mechanism enhances caption consistency

Contrastive learning fine-tunes captioning models

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis