A training-free framework for high-fidelity appearance transfer via diffusion transformers

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing diffusion Transformers (DiTs) often disrupt scene structure due to global self-attention, hindering high-fidelity and controllable reference-based appearance transfer without retraining. The authors propose the first training-free DiT framework for appearance transfer, which achieves precise fine-grained texture migration while preserving geometric structure through three key components: disentanglement of structure and appearance features, high-fidelity inverse mapping, and a geometry-prior-guided dynamic attention sharing mechanism. Evaluated at 1024px resolution, the method attains state-of-the-art performance, outperforming specialized models in both semantic attribute and material transfer tasks, and significantly enhances structural consistency and appearance fidelity.
📝 Abstract
Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.
Problem

Research questions and friction points this paper is trying to address.

appearance transfer
Diffusion Transformers
training-free
structural preservation
high-fidelity generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
diffusion transformers
appearance transfer
attention-sharing mechanism
structure-appearance disentanglement
🔎 Similar Papers
No similar papers found.
S
Shengrong Gu
School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Ye Wang
Ye Wang
Jilin University
Computer Vision
Song Wu
Song Wu
Southwest University
Computer VisionMachine LearningDeep learningMultimedia
Rui Ma
Rui Ma
Associate Professor at Jilin University
computer graphicscomputer visiongeometry modelingshape analysiscontent creation
Q
Qian Wang
JIUTIAN Research, Bejing, China
L
Lanjun Wang
School of New Media and Communication, Tianjin University, Tianjin, China
Z
Zili Yi
School of Intelligence Science and Technology, Nanjing University, Suzhou, China; State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China