🤖 AI Summary
To address global structural distortion, local texture degradation, and poor text-prompt alignment in ultra-high-resolution (4K+) image inpainting, this paper proposes Patch-Adapter—a dual-stage adapter framework that operates without modifying pre-trained diffusion models. The method decouples global semantic coherence from local detail fidelity: Stage I employs dual-context adapters to learn coherence from downsampled features; Stage II introduces reference-image patch attention to enable adaptive, full-resolution patch-level feature fusion. Evaluated on OpenImages and Photo-Concept-Bucket, Patch-Adapter achieves state-of-the-art performance, significantly suppressing large-area inpainting artifacts while improving perceptual quality and text–image alignment accuracy.
📝 Abstract
In this work, we present Patch-Adapter, an effective framework for high-resolution text-guided image inpainting. Unlike existing methods limited to lower resolutions, our approach achieves 4K+ resolution while maintaining precise content consistency and prompt alignment, two critical challenges in image inpainting that intensify with increasing resolution and texture complexity. Patch-Adapter leverages a two-stage adapter architecture to scale the diffusion model's resolution from 1K to 4K+ without requiring structural overhauls: (1) Dual Context Adapter learns coherence between masked and unmasked regions at reduced resolutions to establish global structural consistency; and (2) Reference Patch Adapter implements a patch-level attention mechanism for full-resolution inpainting, preserving local detail fidelity through adaptive feature fusion. This dual-stage architecture uniquely addresses the scalability gap in high-resolution inpainting by decoupling global semantics from localized refinement. Experiments demonstrate that Patch-Adapter not only resolves artifacts common in large-scale inpainting but also achieves state-of-the-art performance on the OpenImages and Photo-Concept-Bucket datasets, outperforming existing methods in both perceptual quality and text-prompt adherence.