🤖 AI Summary
Existing step-by-step drawing tutorials lack interactivity and personalization, while mainstream generative models suffer from poor cross-media generalization—exhibiting temporal incoherence and structural distortion—and fail to faithfully replicate human artistic workflows. To address this, we propose a semantic-guided cross-media diffusion model, trained via reverse-drawing optimization on a large-scale dataset of real artistic process trajectories. Our method ensures texture-consistent evolution and procedural transfer across media. We introduce the Perceptual Distance Profile (PDP), the first metric to quantitatively characterize stage-wise progression—from composition and underpainting to detail refinement—enabling holistic process modeling. Integrating semantic segmentation guidance, cross-media style enhancement, and multi-metric evaluation (CLIP, DINO, LPIPS), our approach achieves significant improvements over baselines in cross-media consistency, temporal coherence, and final output quality. This work establishes a new paradigm for interactive, personalized art instruction.
📝 Abstract
Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.