UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the limitation of existing RGB-based latent action representations, which lack explicit 3D geometric structure and thus struggle to support precise, contact-rich robotic manipulation. To overcome this, the authors propose UniLARN, a unified framework that integrates RGB and depth modalities to construct depth-aware latent action representations. Built upon a Transformer architecture, UniLARN jointly models cross-modal interactions and learns a shared embedding space through inverse and forward dynamics objectives, yielding both modality-specific and unified action representations. Experimental results demonstrate that the proposed method significantly outperforms RGB-only baselines in both simulation and real-world settings, exhibiting superior spatial understanding and generalization across in-domain/out-of-domain and seen/unseen manipulation tasks.

Technology Category

Application Category

📝 Abstract
Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

latent action
vision-language-action models
depth-aware
3D geometric structure
robot manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

depth-aware latent actions
vision-language-action models
unified latent action learning
cross-modal interaction
geometric structure
🔎 Similar Papers
No similar papers found.