FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation

πŸ“… 2026-02-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation of existing vision-language-action (VLA) models, which uniformly fuse multimodal signals at a fixed frequency, thereby downsampling high-frequency force feedback and impairing responsiveness to dynamic contact changes. To overcome this, the authors propose a fast-slow decoupled VLA architecture: a low-frequency vision-language module handles perception, planning, and future force prediction, while a high-frequency action expert generates reactive motions based on real-time force sequences, injecting multi-layer force features via a dedicated force adapter. This approach enables the first force-driven adaptive execution frequency scheduling, effectively decoupling the timescales of perception and control. Evaluated on contact-intensive tasks, the method significantly outperforms baselines in both reactivity and success rate, achieving robust manipulation with reduced contact forces.

Technology Category

Application Category

πŸ“ Abstract
Force/torque feedback can substantially improve Vision-Language-Action (VLA) models on contact-rich manipulation, but most existing approaches fuse all modalities at a single operating frequency. This design ignores the mismatched sampling rates of real robot sensors, forcing downsampling of the high-frequency contact cues needed for reactive correction. Combined with common VLM-action-expert (AE) pipelines that execute action chunks largely open loop between expensive VLM updates, unified-frequency fusion often yields delayed responses to impacts, stick-slip, and force spikes. We propose FAVLA, a force-adaptive fast-slow VLA that decouples slow perception planning from fast contact-aware control. FAVLA runs a slow VLM at a fixed low frequency to encode modalities to produce latent representations and to predict near-future force variation. A fast AE then executes at a variable high frequency, conditioning on the latest force sequence data to generate reactive actions. We further introduce a force adapter that injects high-frequency force features into multiple AE layers, and adaptively schedules the AE's execution frequency based on the VLM's predicted force variation. Extensive experiments on contact-rich tasks demonstrate that FAVLA significantly outperforms baselines, achieving superior reactivity and success rates, especially with a smaller contact force during manipulation.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
contact-rich manipulation
force feedback
sensor fusion
reactive control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Force-Adaptive
Fast-Slow Architecture
Contact-Rich Manipulation
Variable Execution Frequency
Force Adapter
πŸ”Ž Similar Papers
No similar papers found.