🤖 AI Summary
This work addresses the vulnerability of existing latent action models to task-irrelevant distractors, which often leads to the erroneous encoding of noise as action signals. To mitigate this, the authors propose a novel approach that leverages the commonsense reasoning capabilities of vision-language models (VLMs) to generate task-aware representations distinguishing controllable dynamics from noise. Specifically, task-oriented natural language prompts—such as “ignore distractors”—are used to guide VLMs in producing supervision signals that, in an unsupervised setting, steer latent action models toward learning task-centric action representations. Evaluated on the Distracting MetaWorld benchmark, the method improves downstream task success rates by up to sixfold, significantly enhances action semantic consistency, and effectively suppresses interference. The study also reveals notable differences in prompt sensitivity and performance across various VLMs in the context of action representation learning.
📝 Abstract
Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts and hyperparameters. Interestingly, we find that more recent VLMs may perform worse than older ones. Finally, we show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.