π€ AI Summary
This work addresses the limitation of existing vision-language-action (VLA) models, which uniformly fuse multimodal signals at a fixed frequency, thereby downsampling high-frequency force feedback and impairing responsiveness to dynamic contact changes. To overcome this, the authors propose a fast-slow decoupled VLA architecture: a low-frequency vision-language module handles perception, planning, and future force prediction, while a high-frequency action expert generates reactive motions based on real-time force sequences, injecting multi-layer force features via a dedicated force adapter. This approach enables the first force-driven adaptive execution frequency scheduling, effectively decoupling the timescales of perception and control. Evaluated on contact-intensive tasks, the method significantly outperforms baselines in both reactivity and success rate, achieving robust manipulation with reduced contact forces.
π Abstract
Force/torque feedback can substantially improve Vision-Language-Action (VLA) models on contact-rich manipulation, but most existing approaches fuse all modalities at a single operating frequency. This design ignores the mismatched sampling rates of real robot sensors, forcing downsampling of the high-frequency contact cues needed for reactive correction. Combined with common VLM-action-expert (AE) pipelines that execute action chunks largely open loop between expensive VLM updates, unified-frequency fusion often yields delayed responses to impacts, stick-slip, and force spikes. We propose FAVLA, a force-adaptive fast-slow VLA that decouples slow perception planning from fast contact-aware control. FAVLA runs a slow VLM at a fixed low frequency to encode modalities to produce latent representations and to predict near-future force variation. A fast AE then executes at a variable high frequency, conditioning on the latest force sequence data to generate reactive actions. We further introduce a force adapter that injects high-frequency force features into multiple AE layers, and adaptively schedules the AE's execution frequency based on the VLM's predicted force variation. Extensive experiments on contact-rich tasks demonstrate that FAVLA significantly outperforms baselines, achieving superior reactivity and success rates, especially with a smaller contact force during manipulation.