🤖 AI Summary
Visual language models (VLMs) struggle to model the causal and reversible nature of physical actions—a critical gap for embodied AI. Method: We introduce the “Do-Undo” task and benchmark, the first systematic framework for evaluating models’ understanding of real-world reversible physical transformations. Our approach comprises: (1) constructing the first large-scale, video-derived dataset of reversible actions; (2) designing a bidirectional scene transformation modeling framework that integrates action-conditioned generation with physics-consistency regularization; and (3) imposing cross-directional consistency constraints to enhance joint modeling fidelity. Results: Extensive experiments reveal severe limitations of state-of-the-art VLMs on this task; our method significantly improves joint accuracy in forward action generation and backward scene reconstruction. This work establishes the first dedicated benchmark and modeling paradigm for physical reversibility in embodied AI, advancing evaluation and capability development for robotic manipulation and physics-aware generative reasoning.
📝 Abstract
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.