CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image captioning models struggle to flexibly and continuously control linguistic attributes—such as caption length, descriptiveness, and lexical uniqueness—primarily due to the absence of explicit attribute-conditioned modeling during training and a lack of smooth interpolation mechanisms. This paper proposes the first fully self-supervised, multi-dimensional controllable captioning framework that requires no human annotations. First, discrete linguistic attributes are quantified into continuous scalar representations via self-supervised learning. Second, an endpoint vector interpolation mechanism is introduced to enable fine-grained, differentiable control over linguistic styles in the latent space. The method preserves high word–image alignment accuracy while reducing caption length control error by 506% compared to baselines. To our knowledge, this is the first approach achieving disentangled, continuous, and entirely self-supervised control over multiple caption attributes.

Technology Category

Application Category

📝 Abstract
An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506% despite better lexical alignment. Code will be available on https://github.com/omron-sinicx/captionsmiths.
Problem

Research questions and friction points this paper is trying to address.

Flexibly control language patterns in image captioning
Enable smooth transition between caption properties
Improve caption length and lexical alignment accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifies caption properties without human annotation
Uses interpolation between endpoint vectors for conditioning
Smoothly controls output caption properties
🔎 Similar Papers
No similar papers found.