Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work proposes Self-Flow, a self-supervised flow matching framework that addresses the limitations of existing generative models—such as objective misalignment, disjoint training procedures, and anomalous scaling behavior—by integrating representation learning directly into the generative process. The key innovation lies in a Dual-Timestep Scheduling mechanism, which applies heterogeneous noise to different tokens to create information asymmetry, compelling the model to infer missing content from corrupted inputs. This joint optimization of semantic representations and generation capabilities eliminates the need for external pretraining or supervision. Self-Flow achieves state-of-the-art performance across image, video, and audio generation tasks, demonstrates strong cross-modal generalization, and exhibits favorable scaling properties.

Technology Category

Application Category

📝 Abstract

Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Problem

Research questions and friction points this paper is trying to address.

semantic representations

diffusion models

flow matching

self-supervision

multi-modal synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Supervised Learning

Flow Matching

Dual-Timestep Scheduling