SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address semantic misalignment—such as missing objects or incorrect attributes—in synthetic image-caption pairs for zero-shot image captioning (ZIC), which introduces substantial data noise, this paper proposes a redistribution-based dataset optimization framework, departing from conventional filtering or reconstruction strategies. Our core innovation is the first introduction of one-to-many image-caption mapping coupled with a cycle-consistency scoring mechanism. This mechanism integrates text-to-image retrieval, image-to-text reverse retrieval, and similarity alignment scoring to establish a closed-loop caption redistribution pipeline. Evaluated on MS-COCO, Flickr30k, and NoCaps, our method consistently improves the performance of diverse ZIC models, achieving state-of-the-art results on BLEU-4, CIDEr, and other standard metrics. Experimental results demonstrate both the effectiveness and generalizability of semantic alignment optimization in enhancing synthetic dataset quality for ZIC.

Technology Category

Application Category

📝 Abstract

Zero-shot Image Captioning (ZIC) increasingly utilizes synthetic datasets generated by text-to-image (T2I) models to mitigate the need for costly manual annotation. However, these T2I models often produce images that exhibit semantic misalignments with their corresponding input captions (e.g., missing objects, incorrect attributes), resulting in noisy synthetic image-caption pairs that can hinder model training. Existing dataset pruning techniques are largely designed for removing noisy text in web-crawled data. However, these methods are ill-suited for the distinct challenges of synthetic data, where captions are typically well-formed, but images may be inaccurate representations. To address this gap, we introduce SynC, a novel framework specifically designed to refine synthetic image-caption datasets for ZIC. Instead of conventional filtering or regeneration, SynC focuses on reassigning captions to the most semantically aligned images already present within the synthetic image pool. Our approach employs a one-to-many mapping strategy by initially retrieving multiple relevant candidate images for each caption. We then apply a cycle-consistency-inspired alignment scorer that selects the best image by verifying its ability to retrieve the original caption via image-to-text retrieval. Extensive evaluations demonstrate that SynC consistently and significantly improves performance across various ZIC models on standard benchmarks (MS-COCO, Flickr30k, NoCaps), achieving state-of-the-art results in several scenarios. SynC offers an effective strategy for curating refined synthetic data to enhance ZIC.

Problem

Research questions and friction points this paper is trying to address.

Addresses semantic misalignments in synthetic image-caption pairs

Improves dataset quality for zero-shot image captioning models

Introduces one-to-many mapping for better caption-image alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses one-to-many mapping for caption reassignment

Employs cycle-consistency-inspired alignment scorer

Refines synthetic datasets for zero-shot captioning

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis