🤖 AI Summary
Flow matching (FM) suffers from error accumulation along integration trajectories during velocity field learning, causing generated samples to deviate from the data manifold—especially under low-step sampling or with lightweight models, leading to substantial degradation in sample quality. To address this, we propose a bidirectional attraction-repulsion training paradigm, introducing Velocity Contrastive Regularization (VeCoR), which imposes dual supervision on the velocity field: alignment of positive velocity pairs and repulsion of negative ones. This upgrades the conventional unidirectional attraction objective to a geometrically grounded bidirectional constraint, thereby stabilizing trajectory evolution and enhancing manifold consistency. Experiments on text-to-image generation over ImageNet-1K and MS-COCO demonstrate relative FID improvements of 22–35% (ImageNet) and 32% (COCO) over FM baselines, alongside accelerated convergence and improved training stability. Our key contribution is the first integration of contrastive learning into FM-based velocity field optimization, significantly boosting perceptual fidelity and generalization under low computational overhead.
📝 Abstract
Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations.
To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose extbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones.
On ImageNet-1K 256$ imes$256, VeCoR yields 22% and 35% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/