🤖 AI Summary
Existing reference-image-guided diffusion models suffer from fine-grained texture degradation due to VAE latent-space compression, weakening identity and attribute cues; subsequent local post-editing often introduces inconsistencies in illumination, texture, or shape. Method: We propose a two-stage refinement framework: (1) global structural fidelity preservation, followed by (2) reinforcement learning–driven local diffusion editing that jointly optimizes fine-grained texture restoration and semantic consistency—thereby overcoming the VAE compression bottleneck. Our method fine-tunes a single-image diffusion editor using both reference images and sketch inputs, and introduces a custom reward function to guide detail accuracy and contextual coherence. Contribution/Results: Extensive evaluation demonstrates significant improvements over both open-source and commercial baselines across multiple benchmarks, achieving state-of-the-art performance in reference alignment, texture fidelity, and local consistency.
📝 Abstract
Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.