🤖 AI Summary
This paper addresses the challenging problem of generating physically plausible 4D dynamic object sequences from only two 2D RGB images depicting initial and final states. We propose a two-stage decoupled framework that requires no 3D templates, category priors, or in-the-wild image registration. First, we leverage a pretrained generative image-to-3D reconstruction model to recover geometry and texture for both end states. Second, a differentiable, physics-driven deformation module evolves intermediate frames via latent-space interpolation, ensuring motion plausibility and strict geometric-textural consistency. The method achieves high-fidelity, unsupervised 4D sequence generation—significantly reducing reliance on large-scale 4D annotated datasets and high computational resources. To our knowledge, it is the first approach to enable end-to-end synthesis of spatiotemporally coherent 4D dynamic content from just two input RGB images.
📝 Abstract
Despite the astonishing progress in generative AI, 4D dynamic object generation remains an open challenge. With limited high-quality training data and heavy computing requirements, the combination of hallucinating unseen geometry together with unseen movement poses great challenges to generative models. In this work, we propose TwoSquared as a method to obtain a 4D physically plausible sequence starting from only two 2D RGB images corresponding to the beginning and end of the action. Instead of directly solving the 4D generation problem, TwoSquared decomposes the problem into two steps: 1) an image-to-3D module generation based on the existing generative model trained on high-quality 3D assets, and 2) a physically inspired deformation module to predict intermediate movements. To this end, our method does not require templates or object-class-specific prior knowledge and can take in-the-wild images as input. In our experiments, we demonstrate that TwoSquared is capable of producing texture-consistent and geometry-consistent 4D sequences only given 2D images.