🤖 AI Summary
Existing image-to-video methods struggle to generate high-fidelity motion graphics featuring active text animation and object deformation, while code-based vector animation approaches rely on manually annotated hierarchical vector structures, limiting applicability to single raster inputs. This paper introduces the first end-to-end framework that reconstructs semantic, layered HTML structure directly from a single raster image and synthesizes executable JavaScript animation code. Our method integrates raster image layer decomposition, HTML semantic reconstruction, cross-modal alignment—leveraging diffusion or generative models—and animation code synthesis, all without requiring manual vector layer annotations. Experiments demonstrate that our generated motion graphics significantly outperform general-purpose image-to-video models in text readability, structural fidelity, and motion plausibility. Crucially, outputs are deployable, editable frontend code—effectively bridging raster inputs and executable, vector-based motion graphics.
📝 Abstract
General image-to-video generation methods often produce suboptimal animations that do not meet the requirements of animated graphics, as they lack active text motion and exhibit object distortion. Also, code-based animation generation methods typically require layer-structured vector data which are often not readily available for motion graphic generation. To address these challenges, we propose a novel framework named MG-Gen that reconstructs data in vector format from a single raster image to extend the capabilities of code-based methods to enable motion graphics generation from a raster image in the framework of general image-to-video generation. MG-Gen first decomposes the input image into layer-wise elements, reconstructs them as HTML format data and then generates executable JavaScript code for the reconstructed HTML data. We experimentally confirm that ours{} generates motion graphics while preserving text readability and input consistency. These successful results indicate that combining layer decomposition and animation code generation is an effective strategy for motion graphics generation.