OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-image manipulation (TIM) methods suffer from three key limitations: incomplete text removal, weak stylistic control, and spurious character repetition. To address these, we propose OmniText—the first training-free, general-purpose framework supporting diverse editing tasks including text deletion, replacement, redrawing, and style transfer. Its core innovations are: (1) a diffusion-based self-attention inversion and cross-attention redistribution mechanism to effectively suppress text hallucination; and (2) a dual loss function coupling content preservation with style disentanglement for precise latent-space optimization. Extensive evaluations demonstrate that OmniText outperforms existing text inpainting methods and matches the performance of task-specific models across multiple benchmarks. To foster standardized evaluation, we concurrently release OmniText-Bench—the first comprehensive benchmark for TIM, covering diverse editing scenarios and quantitative metrics. This work establishes a new foundation for robust, controllable, and generalizable text-aware image editing.

Technology Category

Application Category

📝 Abstract
Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.
Problem

Research questions and friction points this paper is trying to address.

Removing text from images without training
Controlling text style and content in images
Preventing duplicated letters in text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free generalist for text-image manipulation tasks
Self-attention inversion enables text removal and reduces hallucinations
Cross-attention redistribution with novel loss functions controls style
🔎 Similar Papers
No similar papers found.