๐ค AI Summary
This work addresses the challenge of jointly achieving multimodal modeling and efficient regression in vision-based motion control. We propose L1 Flow: a novel flow matching framework reformulated around L1-loss-based sample prediction, parameterized in v-prediction form, and accelerated via single-step ODE integration for rapid sampling. To preserve multimodality while avoiding mode collapse, we introduce a two-stage inference strategyโfirst generating a suboptimal action sequence in one step, then refining it via a single high-fidelity prediction. Evaluated across 14 simulation and real-world tasks from MimicGen, RoboMimic, and PushT, L1 Flow achieves significant improvements in both training and inference efficiency (2.3ร average speedup) while matching or surpassing state-of-the-art denoising-based baselines in performance.
๐ Abstract
Denoising-based models, such as diffusion and flow matching, have been a critical component of robotic manipulation for their strong distribution-fitting and scaling capacity. Concurrently, several works have demonstrated that simple learning objectives, such as L1 regression, can achieve performance comparable to denoising-based methods on certain tasks, while offering faster convergence and inference. In this paper, we focus on how to combine the advantages of these two paradigms: retaining the ability of denoising models to capture multi-modal distributions and avoid mode collapse while achieving the efficiency of the L1 regression objective. To achieve this vision, we reformulate the original v-prediction flow matching and transform it into sample-prediction with the L1 training objective. We empirically show that the multi-modality can be expressed via a single ODE step. Thus, we propose extbf{L1 Flow}, a two-step sampling schedule that generates a suboptimal action sequence via a single integration step and then reconstructs the precise action sequence through a single prediction. The proposed method largely retains the advantages of flow matching while reducing the iterative neural function evaluations to merely two and mitigating the potential performance degradation associated with direct sample regression. We evaluate our method with varying baselines and benchmarks, including 8 tasks in MimicGen, 5 tasks in RoboMimic & PushT Bench, and one task in the real-world scenario. The results show the advantages of the proposed method with regard to training efficiency, inference speed, and overall performance. href{https://song-wx.github.io/l1flow.github.io/}{Project Website.}