Linear Alignment of Vision-language Models for Image Captioning

📅 2023-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address modality misalignment in vision-language models (e.g., CLIP) within the joint image–text embedding space—which limits image captioning performance—this paper proposes ReCap. First, it introduces a closed-form linear realignment method to efficiently calibrate image and text representations directly in the CLIP space. Second, it designs a lightweight conditional language model that generates high-quality captions conditioned on the realigned CLIP features. Third, it proposes two learnable evaluation metrics grounded in the aligned CLIP similarity scores, significantly improving correlation with human judgments. Experiments show that ReCap trains 1,000× faster than existing lightweight approaches, matches state-of-the-art performance on MS-COCO and Flickr30k, achieves new best results on VizWiz and MSRVTT, and demonstrates strong generalization and robustness to input noise.
📝 Abstract
Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches leverage CLIP for cross-modal retrieval to condition pre-trained language models on visual input. However, CLIP generally suffers from a mis-alignment of image and text modalities in the joint embedding space. We investigate efficient methods to linearly re-align the joint embedding space for the downstream task of image captioning. This leads to an efficient training protocol that merely requires computing a closed-form solution for a linear mapping in the joint CLIP space. Consequently, we propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics built on CLIP score along with our proposed alignment. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz and MSRVTT. On the former two, ReCap performs comparably to state-of-the-art lightweight methods using rule-based metrics while outperforming them on most of the CLIP-based metrics. On the latter two benchmarks, ReCap consistently outperforms competitors across all metrics and exhibits strong transfer capabilities and resilience to noise. Finally, we demonstrate that our proposed metrics correlate stronger with human judgement than existing metrics on the Flickr8k-Expert, Flickr8k-Crowdflower, and THumB datasets.
Problem

Research questions and friction points this paper is trying to address.

Aligns image-text modalities in CLIP space
Proposes efficient linear mapping for captioning
Introduces new CLIP-based image-captioning metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear re-alignment in CLIP space
Lightweight ReCap method
New CLIP-based captioning metrics
🔎 Similar Papers
No similar papers found.
Fabian Paischer
Fabian Paischer
Senior Scientist, Institute for Machine Learning, ELLIS Unit / University Linz, EmmiAI
AI4ScienceDeepLearningNuclear FusionDeep Reinforcement LearningNatural Language Processing
M
M. Hofmarcher
ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria; JKU LIT SAL eSPML Lab, Institute for Machine Learning, Johannes Kepler University, Linz, Austria
Sepp Hochreiter
Sepp Hochreiter
Institute for Machine Learning, Johannes Kepler University Linz
Machine LearningDeep LearningArtificial IntelligenceNeural NetworksBioinformatics
T
Thomas Adler
ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria