CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing image captioning models struggle to flexibly and continuously control linguistic attributes—such as caption length, descriptiveness, and lexical uniqueness—primarily due to the absence of explicit attribute-conditioned modeling during training and a lack of smooth interpolation mechanisms. This paper proposes the first fully self-supervised, multi-dimensional controllable captioning framework that requires no human annotations. First, discrete linguistic attributes are quantified into continuous scalar representations via self-supervised learning. Second, an endpoint vector interpolation mechanism is introduced to enable fine-grained, differentiable control over linguistic styles in the latent space. The method preserves high word–image alignment accuracy while reducing caption length control error by 506% compared to baselines. To our knowledge, this is the first approach achieving disentangled, continuous, and entirely self-supervised control over multiple caption attributes.

Technology Category

Application Category

📝 Abstract

An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506% despite better lexical alignment. Code will be available on https://github.com/omron-sinicx/captionsmiths.

Problem

Research questions and friction points this paper is trying to address.

Flexibly control language patterns in image captioning

Enable smooth transition between caption properties

Improve caption length and lexical alignment accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifies caption properties without human annotation

Uses interpolation between endpoint vectors for conditioning

Smoothly controls output caption properties

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis