LyTimeT: Towards Robust and Interpretable State-Variable Discovery

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of extracting dynamical state variables from high-dimensional videos corrupted by background motion, occlusions, and texture variations, this paper proposes a two-stage latent-space learning framework. First, a spatiotemporal autoencoder based on TimeSformer is constructed to extract robust representations via global attention. Second, Lyapunov stability regularization is introduced to jointly enforce dynamic contractivity, disturbance robustness, and physical interpretability—achieving, for the first time, their unified balance in latent space. Physical variables are disentangled via linear correlation analysis, and rollout error accumulation is suppressed. Evaluated on five synthetic and four real-world dynamical systems, the method significantly outperforms CNN- and pure-Transformer-based baselines in mutual information, intrinsic dimension estimation, and long-horizon prediction accuracy, while exhibiting invariance to background disturbances.

Technology Category

Application Category

📝 Abstract
Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.
Problem

Research questions and friction points this paper is trying to address.

Extracting true dynamical variables from high-dimensional video data
Learning robust latent representations by suppressing distracting visual factors
Enforcing stability constraints to improve interpretability and prediction accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase framework for robust latent state learning
Spatio-temporal attention to suppress distracting visual factors
Lyapunov-based stability regularizer for interpretable dynamics
🔎 Similar Papers
No similar papers found.
K
Kuai Yu
Department of Computer Science, Columbia University, New York, NY , USA
C
Crystal Su
Department of Computer Science, Columbia University, New York, NY , USA
X
Xiang Liu
School of Computing, National University of Singapore, Singapore
J
Judah Goldfeder
Department of Computer Science, Columbia University, New York, NY , USA
M
Mingyuan Shao
Department of Computer Science, Columbia University, New York, NY , USA
Hod Lipson
Hod Lipson
Professor of Mechanical Engineering, Columbia University
RoboticsArtificial IntelligenceAdditive ManufacturingData ScienceMechanical Engineering