🤖 AI Summary
To address the lack of phase adaptability in vision–tactile fusion for dexterous manipulation, this paper proposes a force-guided adaptive vision–tactile fusion framework. Without requiring manual annotations, it introduces a novel force-driven predictive attention mechanism that dynamically modulates modality weights to align with stage-specific perceptual demands; additionally, a self-supervised future force prediction task is designed to enhance tactile representation learning. Key contributions include: (1) the first force-driven temporally adaptive multimodal attention mechanism; (2) the first tactile-enhanced fusion architecture integrating self-supervised force prediction; and (3) autonomous evolution of operation-phase-aware modality weights. Evaluated on three high-contact, fine-grained manipulation tasks in laboratory settings, the framework achieves a mean success rate of 93%, demonstrating both effectiveness and rationality of the proposed approach.
📝 Abstract
Effectively utilizing multi-sensory data is important for robots to generalize across diverse tasks. However, the heterogeneous nature of these modalities makes fusion challenging. Existing methods propose strategies to obtain comprehensively fused features but often ignore the fact that each modality requires different levels of attention at different manipulation stages. To address this, we propose a force-guided attention fusion module that adaptively adjusts the weights of visual and tactile features without human labeling. We also introduce a self-supervised future force prediction auxiliary task to reinforce the tactile modality, improve data imbalance, and encourage proper adjustment. Our method achieves an average success rate of 93% across three fine-grained, contactrich tasks in real-world experiments. Further analysis shows that our policy appropriately adjusts attention to each modality at different manipulation stages. The videos can be viewed at https://adaptac-dex.github.io/.