🤖 AI Summary
Existing vision-language-action (VLA) models struggle to jointly handle temporal prediction and object perception in complex scenes due to their reliance on disentangled latent spaces, which limits robustness. This work proposes the first approach that unifies object perception and temporal dynamics within a shared semantic latent space. By decomposing object-centric representations and using them to condition continuous action generation, the method emphasizes physically relevant cues while suppressing task-irrelevant variations. This joint modeling significantly enhances generalization under distribution shifts, achieving higher success rates and improved robustness across diverse simulated and real-world benchmarks, including LIBERO, LIBERO-Plus, MetaWorld, and SimplerEnv.
📝 Abstract
Robust robotic manipulation requires not only predicting how the scene evolves over time, but also recognizing task-relevant objects in complex scenes. However, existing VLA models face two limitations. They typically act only on the current frame, while future prediction and object-aware reasoning are often learned in separate latent spaces. We propose OFlow (injecting Object-Aware Temporal Flow Matching into VLAs), a framework that addresses both limitations by unifying temporal foresight and object-aware reasoning in a shared semantic latent space. Our method forecasts future latents with temporal flow matching, factorizes them into object-aware representations that emphasize physically relevant cues while filtering task-irrelevant variation, and conditions continuous action generation on these predictions. By integrating OFlow into VLA pipelines, our method enables more reliable control under distribution shifts. Extensive experiments across LIBERO, LIBERO-Plus, MetaWorld, and SimplerEnv benchmarks and real-world tasks demonstrate that object-aware foresight consistently enhances robustness and success.