AGSwap: Overcoming Category Boundaries in Object Fusion via Adaptive Group Swapping

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cross-category object fusion often suffers from visual artifacts (e.g., overlapping ghosts) and semantic fragmentation, leading to visual incoherence and semantic inconsistency—exacerbated by the absence of authoritative benchmarks. To address this, we propose Adaptive Group-wise Embedding Swapping (AGES): a method that dynamically partitions and exchanges object-level group embeddings in feature space, guided by a dynamically balanced evaluation metric for optimization. Furthermore, we introduce COF—the first large-scale, hierarchical cross-category fusion dataset comprising over 450K image-text pairs—constructed upon ImageNet-1K and WordNet to enable fine-grained semantic modeling. Extensive experiments demonstrate that AGES consistently outperforms state-of-the-art methods under both simple and complex text prompts, achieving significant improvements in both semantic consistency and visual fidelity.

Technology Category

Application Category

📝 Abstract
Fusing cross-category objects to a single coherent object has gained increasing attention in text-to-image (T2I) generation due to its broad applications in virtual reality, digital media, film, and gaming. However, existing methods often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. Moreover, progress in this field has been limited by the absence of a comprehensive benchmark dataset. To address these problems, we propose extbf{Adaptive Group Swapping (AGSwap)}, a simple yet highly effective approach comprising two key components: (1) Group-wise Embedding Swapping, which fuses semantic attributes from different concepts through feature manipulation, and (2) Adaptive Group Updating, a dynamic optimization mechanism guided by a balance evaluation score to ensure coherent synthesis. Additionally, we introduce extbf{Cross-category Object Fusion (COF)}, a large-scale, hierarchically structured dataset built upon ImageNet-1K and WordNet. COF includes 95 superclasses, each with 10 subclasses, enabling 451,250 unique fusion pairs. Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1 using simple and complex prompts.
Problem

Research questions and friction points this paper is trying to address.

Fusing cross-category objects into coherent single objects
Overcoming biased and semantically inconsistent fusion results
Addressing the lack of comprehensive benchmark datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Group Swapping for semantic fusion
Dynamic optimization with balance evaluation score
Large-scale Cross-category Object Fusion dataset
🔎 Similar Papers
No similar papers found.
Z
Zedong Zhang
Nanjing University of Science and Technology, CHINA
Y
Ying Tai
Nanjing University, CHINA
Jianjun Qian
Jianjun Qian
Nanjing University of Science and Technology
Pattern RecognitionComputer VisionFace Recognition
J
Jian Yang
Nanjing University of Science and Technology, CHINA
J
Jun Li
Nanjing University of Science and Technology, CHINA