🤖 AI Summary
Current vision-centric foundation models struggle to accurately align the physical logic of ego-motion with visual perception in autonomous driving. This work proposes EgoDyn-Bench, a diagnostic benchmark that maps continuous vehicle kinematics into discrete motion concepts to decouple and evaluate a model’s physical reasoning and visual perception capabilities. Large-scale evaluation of over twenty state-of-the-art VLMs, MLLMs, and VLAs on this benchmark reveals, for the first time, a “perception bottleneck”: existing models rely predominantly on linguistic priors for physical reasoning, with minimal contribution from visual inputs. Introducing explicit trajectory encoding substantially improves physical consistency across all models, surpassing classical geometric baselines and demonstrating an effective pathway toward vision-motion alignment.
📝 Abstract
While Vision-Language Models (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model's internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models