Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

📅 2024-10-09

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 1

career value

181K/year

🤖 AI Summary

Existing image/video caption evaluation metrics (e.g., BLEU, CIDEr) rely on human references or noisy pretraining data, limiting their ability to comprehensively assess semantic quality and fine-grained faithfulness. To address this, we propose PAC-S++, a learnable vision-language metric grounded in CLIP. PAC-S++ innovatively constructs high-quality contrastive learning signals using generated image-caption positive pairs and—uniquely—directly employs this learnable metric as the reward signal in Self-Critical Sequence Training (SCST) for caption model fine-tuning. The method integrates positive-sample-augmented contrastive learning with multimodal noise-robust regularization. Experiments demonstrate that PAC-S++ significantly outperforms conventional metrics across major captioning benchmarks and exhibits heightened sensitivity to object hallucination. When used as a SCST reward, it reduces repetition by 18%, decreases grammatical errors by 23%, and consistently improves semantic richness, faithfulness, and cross-domain generalization.

Technology Category

Application Category

📝 Abstract

Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.

Problem

Research questions and friction points this paper is trying to address.

Existing metrics fail to capture full caption quality.

Metrics rely on noisy or non-specific references.

Need effective metrics for evaluation and generation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

PAC-S++ leverages CLIP model for caption evaluation

Uses cleaned and web-collected data for pre-training

Integrates PAC-S++ in SCST for fine-tuning captions

🔎 Similar Papers

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning