LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the challenge that existing robotic foundation models struggle to effectively leverage transferable dynamics knowledge from heterogeneous embodied data, hindered by coarse-grained data utilization and dataset fragmentation. The authors propose a unified embodied data ingestion framework that jointly learns dynamics, policy, and visual prediction, assigning differentiated roles to data of varying quality—enabling, for the first time, effective use of low-quality trajectories. They also introduce EI-30k, a standardized large-scale embodied interaction dataset. Their method employs a structured DINO latent space for dynamics modeling and a multimodal diffusion Transformer to process asynchronous visual and action streams, avoiding pixel-level redundancy. In both simulation and real-world settings, the approach outperforms prior methods by 21%, 48%, and 23% on contact-rich, dexterous manipulation, and long-horizon tasks, respectively, and achieves a 10% performance gain using only 30% low-quality data.

Technology Category

Application Category

📝 Abstract

Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO latent space, which avoids redundant pixel-space appearance modeling. Complementing this representation, LDA-1B employs a multi-modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B-parameter scale. Experiments in simulation and the real world show LDA-1B outperforms prior methods (e.g., $\pi_{0.5}$) by up to 21\%, 48\%, and 23\% on contact-rich, dexterous, and long-horizon tasks, respectively. Notably, LDA-1B enables data-efficient fine-tuning, gaining 10\% by leveraging 30\% low-quality trajectories typically harmful and discarded.

Problem

Research questions and friction points this paper is trying to address.

robot foundation model

embodied data

dynamics learning

heterogeneous data

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

universal embodied data ingestion

structured latent dynamics

multi-modal diffusion transformer