UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge of misalignment between natural language instructions and reference image styles in visual text editing by proposing UM-Text, a unified multimodal model that eliminates the need for manual specification of font, color, and layout as required by conventional approaches. Built upon a vision-language foundation, UM-Text leverages a UM-Encoder to automatically fuse multimodal conditional embeddings, incorporates a region consistency loss, and employs a three-stage training strategy. The method is supported by UM-DATA-200K, a large-scale dataset curated for this task, enabling end-to-end style-aware text generation. Experimental results demonstrate that UM-Text achieves state-of-the-art performance across multiple public benchmarks, significantly outperforming existing methods in both content accuracy and stylistic harmony.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

visual text editing

style consistency

natural language instruction

image understanding

multimodal modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multimodal Model

Visual Language Model

Style-consistent Text Generation