A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the limitation in contextual imitation learning where action representations lack effective spatiotemporal structure modeling. To this end, the authors propose the Hierarchical Spatio-Temporal Action Tokenizer (HiST-AT), which introduces, for the first time, a hierarchical spatio-temporal modeling mechanism that jointly leverages spatial and temporal cues during action tokenization. HiST-AT employs a two-level vector quantization scheme to simultaneously capture fine-grained and coarse-grained action clusters while reconstructing both actions and their corresponding timestamps. Experimental results demonstrate that HiST-AT achieves state-of-the-art performance across multiple simulated and real-world robotic manipulation benchmarks, significantly enhancing the efficiency, generalization, and robustness of imitation learning.

Technology Category

Application Category

📝 Abstract

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

Problem

Research questions and friction points this paper is trying to address.

in-context imitation learning

action tokenization

spatiotemporal modeling

robotic manipulation

hierarchical representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical vector quantization

spatiotemporal action tokenizer

in-context imitation learning