🤖 AI Summary
Cross-category object fusion often suffers from visual artifacts (e.g., overlapping ghosts) and semantic fragmentation, leading to visual incoherence and semantic inconsistency—exacerbated by the absence of authoritative benchmarks. To address this, we propose Adaptive Group-wise Embedding Swapping (AGES): a method that dynamically partitions and exchanges object-level group embeddings in feature space, guided by a dynamically balanced evaluation metric for optimization. Furthermore, we introduce COF—the first large-scale, hierarchical cross-category fusion dataset comprising over 450K image-text pairs—constructed upon ImageNet-1K and WordNet to enable fine-grained semantic modeling. Extensive experiments demonstrate that AGES consistently outperforms state-of-the-art methods under both simple and complex text prompts, achieving significant improvements in both semantic consistency and visual fidelity.
📝 Abstract
Fusing cross-category objects to a single coherent object has gained increasing attention in text-to-image (T2I) generation due to its broad applications in virtual reality, digital media, film, and gaming. However, existing methods often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. Moreover, progress in this field has been limited by the absence of a comprehensive benchmark dataset. To address these problems, we propose extbf{Adaptive Group Swapping (AGSwap)}, a simple yet highly effective approach comprising two key components: (1) Group-wise Embedding Swapping, which fuses semantic attributes from different concepts through feature manipulation, and (2) Adaptive Group Updating, a dynamic optimization mechanism guided by a balance evaluation score to ensure coherent synthesis. Additionally, we introduce extbf{Cross-category Object Fusion (COF)}, a large-scale, hierarchically structured dataset built upon ImageNet-1K and WordNet. COF includes 95 superclasses, each with 10 subclasses, enabling 451,250 unique fusion pairs. Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1 using simple and complex prompts.