Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

In text-guided image inpainting, diffusion models struggle with fine-grained text-image alignment, while mask autoregressive (MAR) models often neglect textual prompts or compromise background consistency. To address these challenges, this paper proposes a novel training-free framework. Methodologically, it introduces a dual-stream encoder for spatial-frequency feature fusion, coupled with semantic-guided token generation and adaptive attention score enhancement—jointly improving text-image alignment accuracy and background coherence. Notably, this work presents the first successful application of MAR models to zero-shot text-guided inpainting. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art methods across key metrics—including text alignment fidelity, detail preservation, and background consistency—yielding visually superior inpainted results.

Technology Category

Application Category

📝 Abstract

Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to design extbf{Token Painter}, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics and delivers superior visual results. Codes will be released.

Problem

Research questions and friction points this paper is trying to address.

Achieving text-aligned inpainting while preserving background consistency

Addressing background interference in autoregressive image inpainting models

Enhancing prompt detail alignment without requiring model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Mask AutoRegressive models for image inpainting

Introduces Dual-Stream Encoder Information Fusion

Implements Adaptive Decoder Attention Score Enhancing

🔎 Similar Papers

No similar papers found.