UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing RGB-based latent action representations, which lack explicit 3D geometric structure and thus struggle to support precise, contact-rich robotic manipulation. To overcome this, the authors propose UniLARN, a unified framework that integrates RGB and depth modalities to construct depth-aware latent action representations. Built upon a Transformer architecture, UniLARN jointly models cross-modal interactions and learns a shared embedding space through inverse and forward dynamics objectives, yielding both modality-specific and unified action representations. Experimental results demonstrate that the proposed method significantly outperforms RGB-only baselines in both simulation and real-world settings, exhibiting superior spatial understanding and generalization across in-domain/out-of-domain and seen/unseen manipulation tasks.

Technology Category

Application Category

📝 Abstract
Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

latent action
vision-language-action models
depth-aware
3D geometric structure
robot manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

depth-aware latent actions
vision-language-action models
unified latent action learning
cross-modal interaction
geometric structure
🔎 Similar Papers
No similar papers found.
M
Manish Kumar Govind
Department of Computer Science, University of North Carolina at Charlotte, NC, USA
Dominick Reilly
Dominick Reilly
UNC Charlotte
video understandingmultimodal learning
Pu Wang
Pu Wang
University of North Carolina at Charlotte
OptimizationMachine LearningNetworked Systems
S
Srijan Das
Department of Computer Science, University of North Carolina at Charlotte, NC, USA