The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual metaphor generation aims to synthesize semantically faithful and visually coherent images from textual metaphors (e.g., “time is a river”), posing core challenges in conceptual binding between source and target domains and cross-modal alignment. This paper proposes a dual-path generative framework: the first path enables training-free, lightweight generation via prompt decomposition (S-T-M), CLIP-based semantic alignment, and a self-assessment reward mechanism; the second path employs lightweight reinforcement learning to optimize metaphorical structure. We innovatively design the Metaphor Decomposition and Meaning Alignment (MA) metric—the first unsupervised evaluation metric for metaphor quality. Experiments demonstrate that our method outperforms GPT-4o and Imagen in decomposition accuracy, CLIP similarity, and MA score, with significantly lower computational cost. User studies confirm its superior performance in generating abstract metaphors.

Technology Category

Application Category

📝 Abstract
Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach surpasses strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores, with the training-based approach close behind. We evaluate our framework output using a user-facing study, and observed that participants preferred GPT-4o overall, while our training-free pipeline led open-source methods and edged Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or more abstract metaphors, with closed models excelling on short, concrete cases; we also observe sensitivity to sampler settings. Overall, structured prompting and lightweight RL perform metaphor alignment well under modest compute, and remaining gaps to human preference appear driven by aesthetics and sampling.
Problem

Research questions and friction points this paper is trying to address.

Generating images from text metaphors with visual coherence
Evaluating metaphor alignment using decomposition and meaning metrics
Improving metaphor generation without large-scale retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evaluating framework with metaphor decomposition
Training-free S-T-M mapping pipeline for synthesis
Lightweight RL reward schema without retraining
🔎 Similar Papers
No similar papers found.