DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing robotic vision encoders, which predominantly rely on static recognition or image–text alignment pretraining and thus struggle to model dynamic changes induced by actions. To overcome this, the authors propose DynaFLIP, a novel framework that introduces image–language–3D optical flow triplets for dynamics-aware multimodal pretraining, effectively shifting motion understanding into the perceptual stage. The method innovatively minimizes the simplex volume of the three modalities in a hyperspherical embedding space, integrating cosine regularization with contrastive learning to avoid degenerate solutions and explicitly capture the influence of actions on the environment. Experiments demonstrate that DynaFLIP significantly outperforms baseline approaches across diverse simulated and real-world robotic tasks, achieving up to a 22.5% performance gain under out-of-distribution conditions and substantially enhancing robotic generalization.
📝 Abstract
Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.
Problem

Research questions and friction points this paper is trying to address.

robotics perception
motion understanding
multimodal representation
dynamics-aware learning
manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamics-aware perception
multimodal pre-training
simplex-volume minimization
robot manipulation
3D flow