RotVLA: Rotational Latent Action for Vision-Language-Action Model

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses limitations in existing vision–language–action (VLA) models, which rely on discrete latent action representations that often lead to trivial frame reconstruction, restricted expressiveness, and a lack of physical structure. To overcome these issues, the authors propose a novel VLA framework that models latent actions as elements of the special orthogonal group SO(n), thereby introducing— for the first time—a continuous, compositional, and geometrically structured latent space. Coupled with a flow-matching action head, this formulation unifies high-order action planning and execution. A ternary-frame learning mechanism further ensures semantically meaningful and non-degenerate temporal dynamics. The model achieves state-of-the-art performance, scoring 98.2% on LIBERO and 89.6% and 88.5% on RoboTwin2.0 under clean and random settings, respectively, while significantly outperforming existing methods in real-world robotic tasks.

📝 Abstract

Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.

Problem

Research questions and friction points this paper is trying to address.

Latent Action Models

Vision-Language-Action

discrete quantization

representational capacity

physically meaningful structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

rotational latent action

SO(n) representation

flow-matching