🤖 AI Summary
This work addresses the challenge of high-precision force control in contact-rich robotic manipulation, where complex contact dynamics hinder performance. To this end, we propose ForceFlow, a framework that constructs force-aware reactive policies through flow matching and employs a hierarchical architecture comprising a vision-dominated approach phase and a tactile-dominated interaction phase. A key innovation is the Vision-to-Force (V2F) mechanism, which decouples spatial generalization from contact regulation by treating force signals as global modulation factors. We further introduce an asymmetric multimodal fusion strategy and a joint prediction paradigm that, for the first time in imitation learning, explicitly separates visual localization from force control execution. Evaluated on six real-world tasks, ForceFlow achieves a 37% higher success rate than the ForceVLA baseline while demonstrating accurate force prediction, robust contact self-regulation, and strong zero-shot out-of-distribution generalization.
📝 Abstract
Existing imitation learning methods enable robots to interact autonomously with the physical environment. However, contact-rich manipulation tasks remain a significant challenge due to complex contact dynamics that demand high-precision force feedback and control. Although recent efforts have attempted to integrate force/torque sensing into policies, how to build a simple yet effective framework that achieves robust generalization under multimodal observations remains an open question. In this paper, we propose ForceFlow, a force-aware reactive framework built upon flow matching. For contact-stage policy design, we investigate force signal fusion mechanisms and adopt an asymmetric multimodal fusion architecture that treats force as a global regulatory signal, combined with a joint prediction paradigm that enhances the policy's understanding of instantaneous force and historical information, thereby achieving deep coupling between force and motion. For task-level hierarchical decomposition, we divide manipulation into a vision-dominant approach stage (VLM-based pointing for target localization) and a touch-dominant interaction stage (force-driven contact execution), with a Vision-to-Force (V2F) handover mechanism that explicitly decouples spatial generalization from contact regulation. Experimental results across six real-world contact-rich tasks demonstrate that ForceFlow achieves a 37% success rate improvement over the strong baseline ForceVLA while maintaining significantly lower cost. Moreover, ForceFlow exhibits accurate force signal prediction and demonstrates superior performance in contact force self-regulation and zero-shot out-of-distribution (OOD) generalization.