OmniSAT: Compact Action Token, Faster Auto Regression

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Autoregressive models face dual challenges—excessive sequence length and poor reconstruction fidelity—when modeling high-dimensional action sequences. To address this, we propose Omni Swift Action Tokenizer, the first action discretization framework integrating B-spline continuous encoding with multi-stage residual quantization. It enables joint representation learning across diverse robot and human morphologies within a unified action pattern space, and supports scalable auxiliary supervision from heterogeneous egocentric videos. Pretrained on the Droid dataset, our method hierarchically compresses position, rotation, and gripper actions from coarse to fine granularity. Experiments demonstrate a 6.8× reduction in action sequence length, significantly lower target entropy, accelerated autoregressive training convergence, and state-of-the-art reconstruction fidelity and downstream task performance.

Technology Category

Application Category

📝 Abstract

Existing Vision-Language-Action (VLA) models can be broadly categorized into diffusion-based and auto-regressive (AR) approaches: diffusion models capture continuous action distributions but rely on computationally heavy iterative denoising. In contrast, AR models enable efficient optimization and flexible sequence construction, making them better suited for large-scale pretraining. To further improve AR efficiency, particularly when action chunks induce extended and high-dimensional sequences, prior work applies entropy-guided and token-frequency techniques to shorten the sequence length. However, such compression struggled with extit{poor reconstruction or inefficient compression}. Motivated by this, we introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation. Specifically, we first normalize value ranges and temporal horizons to obtain a consistent representation with B-Spline encoding. Then, we apply multi-stage residual quantization to the position, rotation, and gripper subspaces, producing compressed discrete tokens with coarse-to-fine granularity for each part. After pre-training on the large-scale dataset Droid, the resulting discrete tokenization shortens the training sequence by 6.8$ imes$, and lowers the target entropy. To further explore the potential of OmniSAT, we develop a cross-embodiment learning strategy that builds on the unified action-pattern space and jointly leverages robot and human demonstrations. It enables scalable auxiliary supervision from heterogeneous egocentric videos. Across diverse real-robot and simulation experiments, OmniSAT encompasses higher compression while preserving reconstruction quality, enabling faster AR training convergence and model performance.

Problem

Research questions and friction points this paper is trying to address.

Compress high-dimensional action sequences for efficient training

Improve reconstruction quality of compressed action representations

Enable cross-embodiment learning from heterogeneous demonstration sources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact action tokens via multi-stage residual quantization

B-Spline encoding for consistent temporal representation

Cross-embodiment learning from heterogeneous demonstration sources

🔎 Similar Papers

No similar papers found.