A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the limitation in contextual imitation learning where action representations lack effective spatiotemporal structure modeling. To this end, the authors propose the Hierarchical Spatio-Temporal Action Tokenizer (HiST-AT), which introduces, for the first time, a hierarchical spatio-temporal modeling mechanism that jointly leverages spatial and temporal cues during action tokenization. HiST-AT employs a two-level vector quantization scheme to simultaneously capture fine-grained and coarse-grained action clusters while reconstructing both actions and their corresponding timestamps. Experimental results demonstrate that HiST-AT achieves state-of-the-art performance across multiple simulated and real-world robotic manipulation benchmarks, significantly enhancing the efficiency, generalization, and robustness of imitation learning.

Technology Category

Application Category

📝 Abstract
We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.
Problem

Research questions and friction points this paper is trying to address.

in-context imitation learning
action tokenization
spatiotemporal modeling
robotic manipulation
hierarchical representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical vector quantization
spatiotemporal action tokenizer
in-context imitation learning
multi-level clustering
robotic manipulation
🔎 Similar Papers
No similar papers found.