🤖 AI Summary
To address insufficient local geometric awareness in rich manipulation tasks, this paper proposes the Spatially Adaptive Representation Learning (SARL) framework—the first self-supervised representation learning method achieving spatial equivariance for fused vision-tactile imagery. Unlike mainstream self-supervised learning (SSL) approaches that collapse features into global vectors and discard spatial structure, SARL extends the BYOL architecture by imposing three explicit spatial constraints on intermediate feature maps: saliency alignment, patch-wise prototype distribution alignment, and region-wise affinity matching—thereby preserving geometric consistency. Evaluated on six downstream manipulation tasks, SARL consistently outperforms nine state-of-the-art baselines. Notably, for edge pose regression, it achieves a mean absolute error (MAE) of 0.3955, representing a 30% improvement over the best prior SSL method and approaching the performance upper bound of fully supervised learning.
📝 Abstract
Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.