Mining Contextualized Visual Associations from Images for Creativity Understanding

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing vision-language models (e.g., CLIP) rely on literal, low-associative web-scraped annotations, limiting their capacity for creative understanding. Method: We propose a scalable, annotation-free framework that jointly performs image saliency detection and contextual association mining to automatically extract situation-aware visual element relations from unlabeled images. Contribution/Results: This yields the first large-scale creative vision-language dataset (1.7M caption-image pairs) supporting progressive abstraction—from concrete to metaphorical and poetic descriptions. Fine-tuning CLIP and related models on this dataset enables multi-level creative captioning. Human evaluation confirms high visual relevance and abstract plausibility; substantial improvements are observed in zero-shot metaphor visualization and poetic cross-modal retrieval, empirically validating the efficacy of explicit visual associative modeling for creative understanding and generation.

Technology Category

Application Category

📝 Abstract

Understanding another person's creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.

Problem

Research questions and friction points this paper is trying to address.

Mining contextualized visual associations from images

Generating creative captions with increasing abstraction

Improving zero-shot retrieval in creative domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mining contextualized visual associations from images

Generating creative captions with increasing abstraction

Improving zero-shot retrieval in creative domains

🔎 Similar Papers

Using a CNN Model to Assess Paintings' Creativity