π€ AI Summary
To address flickering artifacts in video sketch animation caused by discontinuous motion of stroke control points across frames, this paper proposes an end-to-end vector sketch animation generation framework. The method introduces differentiable motion trajectories (DMTs) to model temporal control-point evolution, parameterized smoothly and differentiably via Bernstein basis functions. It replaces implicit neural fields with explicit sparse trajectory points, enhancing computational efficiency and enabling long-video synthesis. An end-to-end deep network is designed to facilitate inter-frame semantic gradient propagation and ensure stable training. Extensive experiments on DAVIS and LVOS benchmarks demonstrate significant improvements over state-of-the-art methods in both visual quality and temporal coherence, supporting high-frame-rate output. Further validation on 3D-model-driven animation and text-to-video cross-domain generation confirms the frameworkβs robustness and generalization capability.
π Abstract
Sketching is a direct and inexpensive means of visual expression. Though image-based sketching has been well studied, video-based sketch animation generation is still very challenging due to the temporal coherence requirement. In this paper, we propose a novel end-to-end automatic generation approach for vector sketch animation. To solve the flickering issue, we introduce a Differentiable Motion Trajectory (DMT) representation that describes the frame-wise movement of stroke control points using differentiable polynomial-based trajectories. DMT enables global semantic gradient propagation across multiple frames, significantly improving the semantic consistency and temporal coherence, and producing high-framerate output. DMT employs a Bernstein basis to balance the sensitivity of polynomial parameters, thus achieving more stable optimization. Instead of implicit fields, we introduce sparse track points for explicit spatial modeling, which improves efficiency and supports long-duration video processing. Evaluations on DAVIS and LVOS datasets demonstrate the superiority of our approach over SOTA methods. Cross-domain validation on 3D models and text-to-video data confirms the robustness and compatibility of our approach.