🤖 AI Summary
Existing region-controllable diffusion models suffer from slow generation speed (52 seconds for 512×512 images) and incompatibility with acceleration techniques such as Latent Consistency Models (LCM), hindering interactive creative applications. To address this, SemanticDraw unifies region-based semantic control with diffusion acceleration—specifically LCM—within a single framework, introducing a streaming batched inference pipeline. Its custom multi-prompt streaming engine integrates region masks, prompt embedding alignment, and latent-space consistency modeling to preserve both multi-prompt expressivity and region-level semantic fidelity. Evaluated on an RTX 2080 Ti, SemanticDraw generates 512×512 images in just 0.64 seconds—achieving a 10× speedup over prior region-controllable methods and enabling sub-second latency. This breakthrough significantly advances the feasibility of real-time, interactive content generation with precise spatial and semantic control.
📝 Abstract
We introduce SemanticDraw, a new paradigm of interactive content creation where high-quality images are generated in near real-time from given multiple hand-drawn regions, each encoding prescribed semantic meaning. In order to maximize the productivity of content creators and to fully realize their artistic imagination, it requires both quick interactive interfaces and fine-grained regional controls in their tools. Despite astonishing generation quality from recent diffusion models, we find that existing approaches for regional controllability are very slow (52 seconds for $512 imes 512$ image) while not compatible with acceleration methods such as LCM, blocking their huge potential in interactive content creation. From this observation, we build our solution for interactive content creation in two steps: (1) we establish compatibility between region-based controls and acceleration techniques for diffusion models, maintaining high fidelity of multi-prompt image generation with $ imes 10$ reduced number of inference steps, (2) we increase the generation throughput with our new multi-prompt stream batch pipeline, enabling low-latency generation from multiple, region-based text prompts on a single RTX 2080 Ti GPU. Our proposed framework is generalizable to any existing diffusion models and acceleration schedulers, allowing sub-second (0.64 seconds) image content creation application upon well-established image diffusion models. Our project page is: https://jaerinlee.com/research/semantic-draw