Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited quality and adaptability of image captioning models in non-English settings and style transfer scenarios, this paper proposes a fine-tuning-free inference-time enhancement method. It leverages human-provided caption reformulations as supervisory signals to guide a lightweight rephrasing model in refining initial captions. This work introduces the first zero-training rephraser trained exclusively on human revision data, integrated into a two-stage inference framework. On German image captioning, it achieves state-of-the-art performance; it also outperforms prior methods in English style transfer. Multidimensional human evaluation confirms significant improvements in grammatical correctness, visual fidelity, and linguistic fluency, corroborated by standard automatic metrics. The core innovation lies in utilizing human caption reformulations as supervision to enable plug-and-play, cross-lingual and cross-style caption optimization without modifying the original captioning model.

Technology Category

Application Category

📝 Abstract
Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training time, i.e., preferences of choice given two samples, does not naturally transfer to the inference phase. We introduce a novel type of feedback -- caption reformulations -- and train models to mimic reformulation feedback based on human annotations. Our method does not require training the image captioning model itself, thereby demanding substantially less computational effort. We experiment with two types of reformulation feedback: first, we collect a dataset of human reformulations that correct errors in the generated captions. We find that incorporating reformulation models trained on this data into the inference phase of existing image captioning models results in improved captions, especially when the original captions are of low quality. We apply our method to non-English image captioning, a domain where robust models are less prevalent, and gain substantial improvement. Second, we apply reformulations to style transfer. Quantitative evaluations reveal state-of-the-art performance on German image captioning and English style transfer, while human validation with a detailed comparative framework exposes the specific axes of improvement.
Problem

Research questions and friction points this paper is trying to address.

Image Captioning
Multilingual Support
Style Transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedback Mechanism
Image Captioning
Human-revised Captions
🔎 Similar Papers
No similar papers found.