Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

πŸ“… 2025-06-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing cross-modal image generation methods suffer from insufficient preservation of photorealistic image details and overly tight coupling between textual and visual concepts, leading to poor controllability. To address these issues, we propose IT-Blenderβ€”a novel framework featuring a hybrid attention mechanism that enables lossless latent-space encoding of real images and, for the first time, achieves explicit disentangled fusion of text-specified objects with image content. Built upon pretrained diffusion models (e.g., Stable Diffusion or FLUX), IT-Blender leverages latent-space guidance and fine-grained attention modeling to facilitate precise, concept-level interaction between vision and language modalities. Extensive experiments demonstrate that our method significantly outperforms mainstream baselines in conceptual fusion accuracy, detail fidelity, and editing controllability. Moreover, it exhibits superior practicality and generalization capability in creative design tasks, establishing new state-of-the-art performance in controllable cross-modal image generation.

Technology Category

Application Category

πŸ“ Abstract
Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.
Problem

Research questions and friction points this paper is trying to address.

Automate blending visual and textual concepts to enhance creativity
Address loss of details in encoding real images for blending
Disentangle image and text inputs for effective conceptual blending
Innovation

Methods, ideas, or system contributions that make the work stand out.

Blends real images and text via blended attention
Uses pretrained diffusion models for latent blending
Encodes reference images without detail loss
πŸ”Ž Similar Papers
No similar papers found.