SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address insufficient local geometric awareness in rich manipulation tasks, this paper proposes the Spatially Adaptive Representation Learning (SARL) framework—the first self-supervised representation learning method achieving spatial equivariance for fused vision-tactile imagery. Unlike mainstream self-supervised learning (SSL) approaches that collapse features into global vectors and discard spatial structure, SARL extends the BYOL architecture by imposing three explicit spatial constraints on intermediate feature maps: saliency alignment, patch-wise prototype distribution alignment, and region-wise affinity matching—thereby preserving geometric consistency. Evaluated on six downstream manipulation tasks, SARL consistently outperforms nine state-of-the-art baselines. Notably, for edge pose regression, it achieves a mean absolute error (MAE) of 0.3955, representing a 30% improvement over the best prior SSL method and approaching the performance upper bound of fully supervised learning.

Technology Category

Application Category

📝 Abstract

Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.

Problem

Research questions and friction points this paper is trying to address.

Develop spatially-aware self-supervised learning for fused visual-tactile perception

Preserve spatial structure in feature maps for contact-rich robotic manipulation

Improve geometry-sensitive tasks by maintaining spatial equivariance across sensory views

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially-aware SSL framework with map-level objectives

Augments BYOL with Saliency, Patch-Prototype, and Region losses

Preserves spatial structure for fused visual-tactile data

🔎 Similar Papers

No similar papers found.

Authors to Follow