🤖 AI Summary
This work addresses the significant performance degradation of vision-forecasting visual-language-action (VF-VLA) models in out-of-distribution (OOD) scenarios, primarily caused by inaccurate future image predictions. To mitigate this issue, the paper introduces test-time training into the VF-VLA framework for the first time, proposing the T³VF method. During inference, T³VF dynamically refines the model using a self-supervised signal derived from the discrepancy between predicted images and subsequent actual observations. An adaptive update filtering mechanism is incorporated to prevent unstable parameter updates. Notably, T³VF achieves substantial improvements in OOD robustness without altering the model architecture or adding auxiliary modules, while incurring minimal inference overhead and effectively alleviating performance deterioration.
📝 Abstract
Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribution (OOD) shifts. Because the quality of action directly depends on the accuracy of the predicted future visual information, OOD conditions affect both stages at once. To address this vulnerability, we propose Test-Time Training Visual Foresight VLA ($T^3$VF), a test-time training approach motivated by the observation that the predicted future image and its subsequent observation form a natural supervision pair. To further address the practical challenges that arise from indiscriminate test-time updates, we introduce an adaptive update filtering mechanism. Empirically, $T^3$VF mitigates the OOD vulnerability of VF-VLA at a modest additional inference cost, without requiring any architectural modification or auxiliary modules.