Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis

📅 2024-12-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

In example-driven semantic image synthesis for multi-object complex scenes, preserving appearance while enabling geometric deformation remains challenging. To address this, we propose the learnable Appearance Matching Adapter (AM-Adapter), the first method to incorporate semantic segmentation maps into cross-image appearance matching. By enhancing cross-image feature alignment within the self-attention modules of diffusion models, AM-Adapter enables automatic multi-object appearance transfer and user-controllable fine-grained detail mapping. We employ a staged training strategy to decouple generation and matching tasks and design a lightweight, automated exemplar retrieval mechanism. With only a minimal number of learnable parameters, our approach significantly improves semantic alignment accuracy and local appearance fidelity—especially in demanding domains such as autonomous driving—achieving state-of-the-art performance. Ablation studies comprehensively validate the effectiveness of each component.

Technology Category

Application Category

📝 Abstract

Exemplar-based semantic image synthesis aims to generate images aligned with given semantic content while preserving the appearance of an exemplar image. Conventional structure-guidance models, such as ControlNet, are limited in that they cannot directly utilize exemplar images as input, relying instead solely on text prompts to control appearance. Recent tuning-free approaches address this limitation by transferring local appearance from the exemplar image to the synthesized image through implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, these methods face challenges when applied to content-rich scenes with significant geometric deformations, such as driving scenes. In this paper, we propose the Appearance Matching Adapter (AM-Adapter), a learnable framework that enhances cross-image matching within augmented self-attention by incorporating semantic information from segmentation maps. To effectively disentangle generation and matching processes, we adopt a stage-wise training approach. Initially, we train the structure-guidance and generation networks, followed by training the AM-Adapter while keeping the other networks frozen. During inference, we introduce an automated exemplar retrieval method to efficiently select exemplar image-segmentation pairs. Despite utilizing a limited number of learnable parameters, our method achieves state-of-the-art performance, excelling in both semantic alignment preservation and local appearance fidelity. Extensive ablation studies further validate our design choices. Code and pre-trained weights will be publicly available.: https://cvlab-kaist.github.io/AM-Adapter/

Problem

Research questions and friction points this paper is trying to address.

Enables multi-object appearance transfer from single scene-level images.

Overcomes limitations of text-only appearance control in conventional models.

Enhances cross-image matching with semantic segmentation for complex scenes.

Innovation

Methods, ideas, or system contributions that make the work stand out.

AM-Adapter enables multi-object appearance transfer.

Integrates semantic info from segmentation maps.

Stage-wise training disentangles generation and matching.

🔎 Similar Papers

Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models