Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Visual language models (VLMs) struggle to model the causal and reversible nature of physical actions—a critical gap for embodied AI. Method: We introduce the “Do-Undo” task and benchmark, the first systematic framework for evaluating models’ understanding of real-world reversible physical transformations. Our approach comprises: (1) constructing the first large-scale, video-derived dataset of reversible actions; (2) designing a bidirectional scene transformation modeling framework that integrates action-conditioned generation with physics-consistency regularization; and (3) imposing cross-directional consistency constraints to enhance joint modeling fidelity. Results: Extensive experiments reveal severe limitations of state-of-the-art VLMs on this task; our method significantly improves joint accuracy in forward action generation and backward scene reconstruction. This work establishes the first dedicated benchmark and modeling paradigm for physical reversibility in embodied AI, advancing evaluation and capability development for robotic manipulation and physics-aware generative reasoning.

Technology Category

Application Category

📝 Abstract

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.

Problem

Research questions and friction points this paper is trying to address.

Addresses physical action understanding and generation in vision-language models.

Simulates and reverses physical scene transformations for cause-effect reasoning.

Evaluates physical reversibility to advance embodied AI and robotics.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates and reverses physical action transformations

Uses real-world video dataset for reversible actions

Enforces consistency via specialized training strategy

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling