The Devil is in the EOS: Sequence Training for Detailed Image Captioning

📅 2025-07-26

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

In image captioning, vision-language models (VLMs) often prematurely truncate outputs due to an inherent bias toward the end-of-sequence (EOS) token induced by cross-entropy training, yielding short, generic, and detail-poor captions. To address this, we propose an unsupervised sequence-level debiasing training method that dynamically suppresses the EOS token’s predicted probability during decoding, thereby mitigating training-induced truncation bias—without requiring additional annotations, reward modeling, or architectural modifications. Our approach is fully compatible with mainstream pre-trained VLMs. Evaluated across three VLMs and three standard benchmarks, it consistently improves caption length (+28.6%) and fine-grained content coverage. Human evaluation confirms enhanced descriptive richness. Although hallucination rates rise marginally (+1.3%), overall caption quality and faithfulness remain robust.

Technology Category

Application Category

📝 Abstract

Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model's tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of detail in image captioning models

Identifies bias towards premature EOS token prediction

Proposes unsupervised debiasing for longer captions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised debiasing of EOS token prediction

Encourages longer detailed captions without supervision

Applicable to any pretrained vision-language model

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis