CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Behavior cloning (BC) suffers from poor generalization under heterogeneous visual inputs (e.g., multi-view, multi-appearance), primarily due to overfitting to individual demonstrations while neglecting shared action structures across samples. To address this, we propose an action-sequence-supervised contrastive learning framework: dynamic time warping (DTW) automatically aligns semantically similar action sequences, yielding state-free supervision signals; a weighted positive-sample soft InfoNCE loss enables strong semantic alignment in representation learning. Our method synergizes with Diffusion Policy pretraining and employs retrieval-based control for policy learning. Evaluated on five simulated and three real-robot tasks under significant visual domain shifts, our approach achieves a mean success rate of 75%, substantially outperforming existing BC baselines.

Technology Category

Application Category

📝 Abstract
Recent advances in Behavior Cloning (BC) have led to strong performance in robotic manipulation, driven by expressive models, sequence modeling of actions, and large-scale demonstration data. However, BC faces significant challenges when applied to heterogeneous datasets, such as visual shift with different camera poses or object appearances, where performance degrades despite the benefits of learning at scale. This stems from BC's tendency to overfit individual demonstrations rather than capture shared structure, limiting generalization. To address this, we introduce Contrastive Learning via Action Sequence Supervision (CLASS), a method for learning behavioral representations from demonstrations using supervised contrastive learning. CLASS leverages weak supervision from similar action sequences identified via Dynamic Time Warping (DTW) and optimizes a soft InfoNCE loss with similarity-weighted positive pairs. We evaluate CLASS on 5 simulation benchmarks and 3 real-world tasks to achieve competitive results using retrieval-based control with representations only. Most notably, for downstream policy learning under significant visual shifts, Diffusion Policy with CLASS pre-training achieves an average success rate of 75%, while all other baseline methods fail to perform competitively. Project webpage: https://class-robot.github.io.
Problem

Research questions and friction points this paper is trying to address.

Addresses overfitting in Behavior Cloning for robotic manipulation
Improves generalization across heterogeneous datasets with visual shifts
Enhances policy learning under significant visual variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses supervised contrastive learning for behavioral representations
Leverages Dynamic Time Warping for weak supervision
Optimizes soft InfoNCE loss with similarity-weighted pairs
🔎 Similar Papers
No similar papers found.