🤖 AI Summary
This work addresses the structural redundancy of Lottie JSON files, which hinders their direct use for high-quality vector animation generation. To overcome this limitation, the authors propose a parameterized Lottie tokenizer that decomposes animations into structured command-parameter sequences and introduce the first multimodal instruction-driven framework for vector animation synthesis. By integrating Lottie format parsing, sequence modeling, and a pretrained vision-language model, the method enables precise mapping from text-image instructions to coherent animations. The study also releases MMLottie-2M, a large-scale, professional-grade dataset comprising over two million animation samples. Experimental results demonstrate that the generated animations exhibit strong semantic consistency, visual expressiveness, and high fidelity to user instructions, confirming the effectiveness and controllability of the proposed approach.
📝 Abstract
OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.