L1 Sample Flow for Efficient Visuomotor Learning

๐Ÿ“… 2025-11-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of jointly achieving multimodal modeling and efficient regression in vision-based motion control. We propose L1 Flow: a novel flow matching framework reformulated around L1-loss-based sample prediction, parameterized in v-prediction form, and accelerated via single-step ODE integration for rapid sampling. To preserve multimodality while avoiding mode collapse, we introduce a two-stage inference strategyโ€”first generating a suboptimal action sequence in one step, then refining it via a single high-fidelity prediction. Evaluated across 14 simulation and real-world tasks from MimicGen, RoboMimic, and PushT, L1 Flow achieves significant improvements in both training and inference efficiency (2.3ร— average speedup) while matching or surpassing state-of-the-art denoising-based baselines in performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Denoising-based models, such as diffusion and flow matching, have been a critical component of robotic manipulation for their strong distribution-fitting and scaling capacity. Concurrently, several works have demonstrated that simple learning objectives, such as L1 regression, can achieve performance comparable to denoising-based methods on certain tasks, while offering faster convergence and inference. In this paper, we focus on how to combine the advantages of these two paradigms: retaining the ability of denoising models to capture multi-modal distributions and avoid mode collapse while achieving the efficiency of the L1 regression objective. To achieve this vision, we reformulate the original v-prediction flow matching and transform it into sample-prediction with the L1 training objective. We empirically show that the multi-modality can be expressed via a single ODE step. Thus, we propose extbf{L1 Flow}, a two-step sampling schedule that generates a suboptimal action sequence via a single integration step and then reconstructs the precise action sequence through a single prediction. The proposed method largely retains the advantages of flow matching while reducing the iterative neural function evaluations to merely two and mitigating the potential performance degradation associated with direct sample regression. We evaluate our method with varying baselines and benchmarks, including 8 tasks in MimicGen, 5 tasks in RoboMimic & PushT Bench, and one task in the real-world scenario. The results show the advantages of the proposed method with regard to training efficiency, inference speed, and overall performance. href{https://song-wx.github.io/l1flow.github.io/}{Project Website.}
Problem

Research questions and friction points this paper is trying to address.

Combining denoising models' multi-modal capture with L1 regression efficiency
Reducing flow matching evaluations to two steps for speed
Enhancing robotic manipulation training and inference performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates v-prediction flow matching into sample-prediction
Uses L1 training objective for efficient visuomotor learning
Proposes two-step sampling schedule with single ODE step
๐Ÿ”Ž Similar Papers
No similar papers found.
Weixi Song
Weixi Song
Westlake University & Zhejiang University & Shanghai Innovation Institute
Machine Learning
Z
Zhetao Chen
Zhejiang University
T
Tao Xu
Shanghai Innovation Institute
X
Xianchao Zeng
Shanghai Innovation Institute
X
Xinyu Zhou
Shanghai Innovation Institute
L
Lixin Yang
Shanghai Innovation Institute
D
Donglin Wang
Westlake University
C
Cewu Lu
Shanghai Jiao Tong University
Yong-Lu Li
Yong-Lu Li
Associate Professor, Shanghai Jiao Tong University/Shanghai Innovation Institute
Physical ReasoningRoboticsComputer VisionMachine LearningEmbodied AI