Contrastive Representation Regularization for Vision-Language-Action Models

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing Vision-Language-Action (VLA) models exhibit limited sensitivity to robot proprioceptive states—such as joint angles and end-effector poses—and control signals, leading to insufficient precision in action localization and execution. To address this, we propose Robot-State-aware Contrastive Loss (RS-CL), which leverages relative distances among proprioceptive states as soft supervision to explicitly align action semantics with robot-state representations within the vision-language embedding space. RS-CL enables lightweight, plug-and-play representation regularization without modifying the backbone architecture or retraining the base vision-language model. It synergistically integrates the semantic capabilities of pre-trained vision-language models (VLMs) with the temporal structure of proprioceptive state sequences. Evaluated on RoboCasa-Kitchen, our method improves task success rate from 30.8% to 41.5%; on real-robot complex pick-and-place tasks, it achieves 58.3% success (+13.3%), significantly enhancing physical plausibility and task robustness of generated actions.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL effectively enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipeline. Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models; it pushes the prior art from 30.8% to 41.5% on pick-and-place tasks in RoboCasa-Kitchen, through more accurate positioning during grasping and placing, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing vision-language-action models for robot manipulation tasks

Aligning representations with robotic signals like control actions

Improving manipulation performance through enhanced control-relevant representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Robot State-aware Contrastive Loss for VLA models

Aligns representations with robot proprioceptive states

Enhances control-relevant learning while being lightweight

🔎 Similar Papers

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning