🤖 AI Summary
Cross-concept fusion in generative models often suffers from semantic incompatibility and morphological mismatch; existing approaches rely on fine-tuning or architectural constraints, limiting generalizability and fidelity. This paper proposes a training-free, staged feedback-driven fusion framework. First, source image embeddings are transferred to achieve semantic alignment. Second, staged interpolation is performed in the latent space, augmented by a reverse-order feedback mechanism that dynamically updates auxiliary latent variables to preserve global coherence and local detail fidelity. Finally, conditional guidance ensures natural, harmonious integration. The method requires no model modification or retraining. Evaluated across multiple benchmarks, it significantly improves semantic coherence, structural plausibility, and stylistic consistency of fused images. It establishes a novel zero-shot paradigm for compositional concept synthesis, advancing the state of the art in controllable generative modeling.
📝 Abstract
Concept blending is a promising yet underexplored area in generative models. While recent approaches, such as embedding mixing and latent modification based on structural sketches, have been proposed, they often suffer from incompatible semantic information and discrepancies in shape and appearance. In this work, we introduce FreeBlend, an effective, training-free framework designed to address these challenges. To mitigate cross-modal loss and enhance feature detail, we leverage transferred image embeddings as conditional inputs. The framework employs a stepwise increasing interpolation strategy between latents, progressively adjusting the blending ratio to seamlessly integrate auxiliary features. Additionally, we introduce a feedback-driven mechanism that updates the auxiliary latents in reverse order, facilitating global blending and preventing rigid or unnatural outputs. Extensive experiments demonstrate that our method significantly improves both the semantic coherence and visual quality of blended images, yielding compelling and coherent results.