🤖 AI Summary
Flow Matching (FM) achieves strong performance in generative tasks such as robotic manipulation, yet suffers from training-inference inconsistency: generation quality cannot be assessed during training, and the strong bias toward predefined linear trajectories induces rigidity and instability. This work establishes, for the first time, a theoretical connection between FM’s training loss and inference error. We propose a maximum-likelihood-based reconstruction fine-tuning framework that jointly incorporates residual architecture and compressibility constraints to enhance both robustness and interpretability. Our method supports two fine-tuning strategies—direct and residual—and integrates seamlessly into FM-driven ordinary differential equation (ODE) solvers. Evaluated on image generation and real-world robotic manipulation tasks, it significantly improves inference accuracy and stability. Experiments demonstrate the method’s effectiveness, generalizability, and engineering practicality.
📝 Abstract
Flow Matching (FM) algorithm achieves remarkable results in generative tasks especially in robotic manipulation. Building upon the foundations of diffusion models, the simulation-free paradigm of FM enables simple and efficient training, but inherently introduces a train-inference gap. Specifically, we cannot assess the model's output during the training phase. In contrast, other generative models including Variational Autoencoder (VAE), Normalizing Flow and Generative Adversarial Networks (GANs) directly optimize on the reconstruction loss. Such a gap is particularly evident in scenarios that demand high precision, such as robotic manipulation. Moreover, we show that FM's over-pursuit of straight predefined paths may introduce some serious problems such as stiffness into the system. These motivate us to fine-tune FM via Maximum Likelihood Estimation of reconstructions - an approach made feasible by FM's underlying smooth ODE formulation, in contrast to the stochastic differential equations (SDEs) used in diffusion models. This paper first theoretically analyzes the relation between training loss and inference error in FM. Then we propose a method of fine-tuning FM via Maximum Likelihood Estimation of reconstructions, which includes both straightforward fine-tuning and residual-based fine-tuning approaches. Furthermore, through specifically designed architectures, the residual-based fine-tuning can incorporate the contraction property into the model, which is crucial for the model's robustness and interpretability. Experimental results in image generation and robotic manipulation verify that our method reliably improves the inference performance of FM.