HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitations of existing visuomotor policies that rely on discrete action tokenizers, which introduce quantization errors and necessitate multi-stage training. The authors propose HiFlow, a multiscale autoregressive policy that operates directly in the continuous action space without tokenization. HiFlow constructs coarse-to-fine action targets through temporal pooling and leverages flow matching for end-to-end, single-stage training. This approach represents the first method to model continuous actions without discretization, thereby simplifying the learning pipeline while improving accuracy. Empirical evaluations demonstrate that HiFlow outperforms both diffusion-based models and tokenizer-based autoregressive methods across diverse benchmarks, including MimicGen, RoboTwin 2.0, and real-world robotic environments.

Technology Category

Application Category

📝 Abstract

Coarse-to-fine autoregressive modeling has recently shown strong promise for visuomotor policy learning, combining the inference efficiency of autoregressive methods with the global trajectory coherence of diffusion-based policies. However, existing approaches rely on discrete action tokenizers that map continuous action sequences to codebook indices, a design inherited from image generation where learned compression is necessary for high-dimensional pixel data. We observe that robot actions are inherently low-dimensional continuous vectors, for which such tokenization introduces unnecessary quantization error and a multi-stage training pipeline. In this work, we propose Hierarchical Flow Policy (HiFlow), a tokenization-free coarse-to-fine autoregressive policy that operates directly on raw continuous actions. HiFlow constructs multi-scale continuous action targets from each action chunk via simple temporal pooling. Specifically, it averages contiguous action windows to produce coarse summaries that are refined at finer temporal resolutions. The entire model is trained end-to-end in a single stage, eliminating the need for a separate tokenizer. Experiments on MimicGen, RoboTwin 2.0, and real-world environments demonstrate that HiFlow consistently outperforms existing methods including diffusion-based and tokenization-based autoregressive policies.

Problem

Research questions and friction points this paper is trying to address.

tokenization

autoregressive policy

visuomotor policy learning

quantization error

continuous actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenization-free

flow matching

coarse-to-fine autoregressive