🤖 AI Summary
This work addresses the scarcity of robot data with explicit action labels, which limits the performance of vision–language–action models. To overcome this challenge, the authors propose leveraging temporal information from unlabeled videos to construct an algebraically consistent latent transition space. By imposing compositionality and invertibility constraints, the method learns structured action representations that can guide policy generation without requiring explicit action decoding. The approach jointly optimizes flow matching for policy learning while using pretrained encoder reconstruction, algebraic consistency regularization, and latent transition sequences as auxiliary objectives. Evaluated on the MetaWorld MT50 and LIBERO benchmarks, the model achieves success rates of 85.0% and 98.1%, respectively—significantly outperforming existing baselines—and demonstrates strong generalization capabilities in real-world robotic tasks.
📝 Abstract
Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.