Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Flow Matching (FM) achieves strong performance in generative tasks such as robotic manipulation, yet suffers from training-inference inconsistency: generation quality cannot be assessed during training, and the strong bias toward predefined linear trajectories induces rigidity and instability. This work establishes, for the first time, a theoretical connection between FM’s training loss and inference error. We propose a maximum-likelihood-based reconstruction fine-tuning framework that jointly incorporates residual architecture and compressibility constraints to enhance both robustness and interpretability. Our method supports two fine-tuning strategies—direct and residual—and integrates seamlessly into FM-driven ordinary differential equation (ODE) solvers. Evaluated on image generation and real-world robotic manipulation tasks, it significantly improves inference accuracy and stability. Experiments demonstrate the method’s effectiveness, generalizability, and engineering practicality.

Technology Category

Application Category

📝 Abstract

Flow Matching (FM) algorithm achieves remarkable results in generative tasks especially in robotic manipulation. Building upon the foundations of diffusion models, the simulation-free paradigm of FM enables simple and efficient training, but inherently introduces a train-inference gap. Specifically, we cannot assess the model's output during the training phase. In contrast, other generative models including Variational Autoencoder (VAE), Normalizing Flow and Generative Adversarial Networks (GANs) directly optimize on the reconstruction loss. Such a gap is particularly evident in scenarios that demand high precision, such as robotic manipulation. Moreover, we show that FM's over-pursuit of straight predefined paths may introduce some serious problems such as stiffness into the system. These motivate us to fine-tune FM via Maximum Likelihood Estimation of reconstructions - an approach made feasible by FM's underlying smooth ODE formulation, in contrast to the stochastic differential equations (SDEs) used in diffusion models. This paper first theoretically analyzes the relation between training loss and inference error in FM. Then we propose a method of fine-tuning FM via Maximum Likelihood Estimation of reconstructions, which includes both straightforward fine-tuning and residual-based fine-tuning approaches. Furthermore, through specifically designed architectures, the residual-based fine-tuning can incorporate the contraction property into the model, which is crucial for the model's robustness and interpretability. Experimental results in image generation and robotic manipulation verify that our method reliably improves the inference performance of FM.

Problem

Research questions and friction points this paper is trying to address.

Bridging the train-inference gap in Flow Matching models

Addressing stiffness from over-pursuit of straight predefined paths

Improving precision for robotic manipulation through reconstruction likelihood

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning Flow Matching via maximum likelihood estimation

Incorporating contraction property through residual-based architecture

Optimizing reconstruction loss to bridge train-inference gap

🔎 Similar Papers

Elucidating the Design Choice of Probability Paths in Flow Matching for Forecasting