🤖 AI Summary
This work addresses the degradation in generation quality in few-step image editing caused by approximation errors in the forward process, as well as the limited generalization of existing methods that rely on pretrained models and auxiliary modules. The authors propose BiFM, a unified framework that jointly learns image generation and inversion within a single model by directly modeling the mean velocity field between images and noise through bidirectional flow matching. To stabilize training, BiFM incorporates continuous-time interval supervision and a bidirectional consistency objective. This approach achieves, for the first time, unified bidirectional learning of generation and inversion, enabling exact one-step inversion without additional components. Experiments demonstrate that BiFM significantly outperforms current methods across various few-step editing and generation tasks, achieving superior image quality and enhanced editing controllability.
📝 Abstract
Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.