Canvas-to-Image: Compositional Image Generation with Multimodal Controls

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models struggle to simultaneously achieve high fidelity and compositional consistency under multi-modal conditional control (e.g., text, reference images, pose, and spatial layout). This paper introduces CanvasDiff, a multi-task diffusion generation framework built upon a unified canvas representation. It encodes heterogeneous control signals into a single structured canvas image and incorporates a vision-spatial joint reasoning module alongside a multi-task canvas training strategy to enable end-to-end cross-modal joint modeling. CanvasDiff significantly improves identity preservation, pose accuracy, and layout controllability under complex conditions. It outperforms state-of-the-art methods on challenging tasks including multi-person synthesis, fine-grained pose control, and semantic layout-constrained generation. To foster reproducibility and further research, the code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.
Problem

Research questions and friction points this paper is trying to address.

Enables high-fidelity compositional image generation with multimodal controls
Unifies heterogeneous controls into a single canvas for integrated visual-spatial reasoning
Generalizes to multi-control scenarios through joint multi-task training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified canvas interface for multimodal image generation controls
Multi-task training strategy integrating heterogeneous control signals
Joint learning paradigm for cross-modal reasoning and generalization
🔎 Similar Papers
No similar papers found.