🤖 AI Summary
Existing text-to-image generation models overly prioritize literal semantic alignment, often yielding outputs that lack visual novelty and artistic creativity. To address this limitation, this work proposes Self-Creative Diffusion (SCDiff), a novel framework incorporating a Learnable Spatial Weighting (LSW) module that enhances feature representation in central image regions through parameterized Kaiser-Bessel windows. Additionally, a Visual-Semantic Mixed Loss (VSML) is introduced to jointly preserve semantic fidelity and encourage diverse, imaginative generations. By moving beyond the conventional diffusion paradigm that confines synthesis to high-probability regions of the data distribution, SCDiff significantly improves the creativity, semantic accuracy, and visual coherence of generated images, producing outputs with greater artistic value and perceptual surprise in text-to-object generation tasks.
📝 Abstract
Instilling creativity in text-to-image (T2I) generation presents a significant challenge, as it requires synthesized images to exhibit not only visual novelty and surprise, but also artistic value. Current T2I models, however, are largely optimized for literal text-image alignment with their data distribution, and their noise prediction networks constrain the generation to high-probability regions, consequently generating outputs that lack authentic creativity. To address this, we propose a Self-Creative Diffusion (SCDiff) model for meaningful T2I generations featuring two core modules: a learnable spatial weighting (LSW) module and a visual-semantic mixing loss (VSML). The LSW module designs a parametric Kaiser-Bessel window to reinforce central image features, fostering novel and surprising generation. The VSML module introduces a dual loss function: a similarity loss constrains that the new images align with its textual description, while a diversity loss maximizes its distinction from the original image, enhancing both semantic value and visual novelty. Extensive experiments demonstrate that our model substantially improves creativity, semantic alignment, and visual coherence, offering a simple yet powerful framework for generating creative objects.