RotVLA: Rotational Latent Action for Vision-Language-Action Model

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses limitations in existing vision–language–action (VLA) models, which rely on discrete latent action representations that often lead to trivial frame reconstruction, restricted expressiveness, and a lack of physical structure. To overcome these issues, the authors propose a novel VLA framework that models latent actions as elements of the special orthogonal group SO(n), thereby introducing— for the first time—a continuous, compositional, and geometrically structured latent space. Coupled with a flow-matching action head, this formulation unifies high-order action planning and execution. A ternary-frame learning mechanism further ensures semantically meaningful and non-degenerate temporal dynamics. The model achieves state-of-the-art performance, scoring 98.2% on LIBERO and 89.6% and 88.5% on RoboTwin2.0 under clean and random settings, respectively, while significantly outperforming existing methods in real-world robotic tasks.
📝 Abstract
Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.
Problem

Research questions and friction points this paper is trying to address.

Latent Action Models
Vision-Language-Action
discrete quantization
representational capacity
physically meaningful structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

rotational latent action
SO(n) representation
flow-matching
Vision-Language-Action model
triplet frame learning