Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

To address the limited quality and adaptability of image captioning models in non-English settings and style transfer scenarios, this paper proposes a fine-tuning-free inference-time enhancement method. It leverages human-provided caption reformulations as supervisory signals to guide a lightweight rephrasing model in refining initial captions. This work introduces the first zero-training rephraser trained exclusively on human revision data, integrated into a two-stage inference framework. On German image captioning, it achieves state-of-the-art performance; it also outperforms prior methods in English style transfer. Multidimensional human evaluation confirms significant improvements in grammatical correctness, visual fidelity, and linguistic fluency, corroborated by standard automatic metrics. The core innovation lies in utilizing human caption reformulations as supervision to enable plug-and-play, cross-lingual and cross-style caption optimization without modifying the original captioning model.

Technology Category

Application Category

📝 Abstract

Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training time, i.e., preferences of choice given two samples, does not naturally transfer to the inference phase. We introduce a novel type of feedback -- caption reformulations -- and train models to mimic reformulation feedback based on human annotations. Our method does not require training the image captioning model itself, thereby demanding substantially less computational effort. We experiment with two types of reformulation feedback: first, we collect a dataset of human reformulations that correct errors in the generated captions. We find that incorporating reformulation models trained on this data into the inference phase of existing image captioning models results in improved captions, especially when the original captions are of low quality. We apply our method to non-English image captioning, a domain where robust models are less prevalent, and gain substantial improvement. Second, we apply reformulations to style transfer. Quantitative evaluations reveal state-of-the-art performance on German image captioning and English style transfer, while human validation with a detailed comparative framework exposes the specific axes of improvement.

Problem

Research questions and friction points this paper is trying to address.

Image Captioning

Multilingual Support

Style Transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedback Mechanism

Image Captioning

Human-revised Captions

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis