FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language-action (VLA) models, which uniformly fuse multimodal signals at a fixed frequency, thereby downsampling high-frequency force feedback and impairing responsiveness to dynamic contact changes. To overcome this, the authors propose a fast-slow decoupled VLA architecture: a low-frequency vision-language module handles perception, planning, and future force prediction, while a high-frequency action expert generates reactive motions based on real-time force sequences, injecting multi-layer force features via a dedicated force adapter. This approach enables the first force-driven adaptive execution frequency scheduling, effectively decoupling the timescales of perception and control. Evaluated on contact-intensive tasks, the method significantly outperforms baselines in both reactivity and success rate, achieving robust manipulation with reduced contact forces.

Technology Category

Application Category

📝 Abstract

Force/torque feedback can substantially improve Vision-Language-Action (VLA) models on contact-rich manipulation, but most existing approaches fuse all modalities at a single operating frequency. This design ignores the mismatched sampling rates of real robot sensors, forcing downsampling of the high-frequency contact cues needed for reactive correction. Combined with common VLM-action-expert (AE) pipelines that execute action chunks largely open loop between expensive VLM updates, unified-frequency fusion often yields delayed responses to impacts, stick-slip, and force spikes. We propose FAVLA, a force-adaptive fast-slow VLA that decouples slow perception planning from fast contact-aware control. FAVLA runs a slow VLM at a fixed low frequency to encode modalities to produce latent representations and to predict near-future force variation. A fast AE then executes at a variable high frequency, conditioning on the latest force sequence data to generate reactive actions. We further introduce a force adapter that injects high-frequency force features into multiple AE layers, and adaptively schedules the AE's execution frequency based on the VLM's predicted force variation. Extensive experiments on contact-rich tasks demonstrate that FAVLA significantly outperforms baselines, achieving superior reactivity and success rates, especially with a smaller contact force during manipulation.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

contact-rich manipulation

force feedback

sensor fusion

reactive control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Force-Adaptive

Fast-Slow Architecture

Contact-Rich Manipulation

Variable Execution Frequency

Force Adapter

🔎 Similar Papers

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

2024-09-12arXiv.orgCitations: 0

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

2024-09-19Citations: 5

💼 Related Jobs

AI Research Scientist, Robotics