🤖 AI Summary
This work addresses the temporal mismatch between high-frequency tactile feedback and low-frequency visual planning by proposing a hierarchical fusion architecture that aligns rapid tactile reflexes with slower vision-language-action (VLA) reasoning. The core innovations include a plug-and-play high-frequency tactile interface, a Mamba-based state-space model serving as a tactile history compressor with O(1) inference latency, and a tactile-guided two-stage self-supervised training strategy that integrates temporal contrastive learning with phase-uniform sampling. Evaluated on button-press counting and latent-state switching tasks, the system achieves 100% success rates—significantly outperforming vision-only baselines—while meeting hard real-time constraints with a mere 0.45 ms latency.
📝 Abstract
In visually ambiguous manipulation such as detecting button click tactile feedback is often the sole source of ground truth. However, fusing tactile data poses a significant challenge due to a spatiotemporal mismatch: tactile perception requires high-frequency processing with long-horizon memory (System 1), whereas visual policies operate at low control frequencies (System 2). Existing architectures struggle to bridge this gap: Transformers are computationally prohibitive for high-frequency loops (>100Hz), while LSTMs suffer from forgetting over extended interaction histories. In this paper, we introduce TacMamba, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning. Our approach comprises three core contributions: (1) a custom high-frequency tactile interface designed for flexible integration; (2) a Mamba-based Tactile History Compressor that encodes continuous force history into a compact state with O(1) inference latency (0.45 ms), enabling plug-and-play fusion with VLA models without joint pre-training and (3) a Tactile-Guided Dual-Stage Training strategy that leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity. Experiments on discrete counting and implicit state switching demonstrate that TacMamba achieves 100% success rates, significantly outperforming the visual-only pi_0.5 baseline, while strictly satisfying hard real-time constraints.