🤖 AI Summary
This paper addresses text-driven rigid geometric shape rearrangement: given a set of fixed shapes and a natural language description, the goal is to generate non-overlapping, semantically consistent, and physically plausible vector compositions. Methodologically, we propose the first content-aware differentiable collision resolution mechanism, tightly coupling diffusion-based semantic guidance with explicit geometric constraints—including non-overlap and spatial relations—within an end-to-end differentiable vector generation pipeline. Key technical innovations include Score Distillation Sampling for semantic alignment, differentiable vector rendering, and semantic-aware overlap detection and correction. Experiments demonstrate that our approach significantly outperforms existing baselines across diverse text-to-shape matching tasks. Generated compositions are physically valid, exhibit well-defined spatial relationships, and faithfully realize linguistic semantics. Quantitative evaluations and visual assessments both confirm substantial improvements in accuracy, constraint satisfaction, and perceptual quality.
📝 Abstract
While diffusion-based models excel at generating photorealistic images from text, a more nuanced challenge emerges when constrained to using only a fixed set of rigid shapes, akin to solving tangram puzzles or arranging real-world objects to match semantic descriptions. We formalize this problem as shape-based image generation, a new text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations and visually communicating the target concept. Unlike pixel-manipulation approaches, our method, ShapeShift, explicitly parameterizes each shape within a differentiable vector graphics pipeline, iteratively optimizing placement and orientation through score distillation sampling from pretrained diffusion models. To preserve arrangement clarity, we introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur, ensuring smooth convergence toward physically valid configurations. By bridging diffusion-based semantic guidance with explicit geometric constraints, our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt. Extensive experiments demonstrate compelling results across diverse scenarios, with quantitative and qualitative advantages over alternative techniques.