🤖 AI Summary
This paper addresses the fundamental question of whether latent action models (LAMs) genuinely learn action-driven inter-frame dynamics or merely capture exogenous noise. Method: We develop an analytically tractable linear system model to theoretically characterize the learning mechanism of LAMs, uncovering their intrinsic relationship with principal component analysis (PCA) and rigorously analyzing how structural coupling among observations, actions, and noise governs model performance. Leveraging controllability theory, we derive principled guidelines for designing data generation strategies. Contribution/Results: These guidelines inform video data augmentation, noise denoising, and auxiliary action prediction. Numerical simulations demonstrate that our strategy significantly enhances learning of action-relevant features, thereby advancing the interpretability and reliability of unsupervised action representation learning.
📝 Abstract
Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by controllable changes as well as exogenous noise, leading to an important concern -- do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presenting a linear model that encapsulates the essence of LAM learning, while being tractable.This provides several insights, including connections between LAM and principal component analysis (PCA), desiderata of the data-generating policy, and justification of strategies to encourage learning controllable changes using data augmentation, data cleaning, and auxiliary action-prediction. We also provide illustrative results based on numerical simulation, shedding light on the specific structure of observations, actions, and noise in data that influence LAM learning.