🤖 AI Summary
Existing approaches to latent action learning struggle to disentangle the additive compositional structure inherent in physical motion, often resulting in latent actions that conflate irrelevant scene details or future information and thus fail to accurately capture state transitions and motion magnitudes. To address this, this work proposes the Additive Compositional Latent Action Model (AC-LAM), which introduces additive algebraic constraints—such as identity, inverse, and cycle consistency—into the latent action space for the first time, enforcing an additive structure over short-horizon observations. Through unsupervised visual state-transition learning, AC-LAM yields semantically clear, motion-specific, and displacement-calibrated latent action representations. Evaluated on both simulated and real-world tabletop manipulation tasks, AC-LAM substantially outperforms existing methods and provides more effective supervisory signals for downstream policy learning.
📝 Abstract
Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.