Training Flow Matching Models with Reliable Labels via Self-Purification

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Label noise—such as human annotation errors—severely degrades model performance. To address this, we propose the first self-purifying training framework tailored for Flow Matching (FM), requiring neither pretraining nor auxiliary modules. During training, it dynamically estimates label credibility per sample and filters unreliable instances. Our core innovation lies in embedding the self-purification mechanism directly into FM’s continuous-time modeling: it enables noise-aware gradient updates and adaptive learning-weight adjustment based on estimated label reliability. Experiments on the real-world noisy speech dataset TITW demonstrate that our method significantly improves the accuracy and fidelity of generated samples with respect to conditioning information. It outperforms existing baselines in robustness against label noise, establishing a new state-of-the-art for noise-robust flow-based generative modeling.

Technology Category

Application Category

📝 Abstract

Training datasets are inherently imperfect, often containing mislabeled samples due to human annotation errors, limitations of tagging models, and other sources of noise. Such label contamination can significantly degrade the performance of a trained model. In this work, we introduce Self-Purifying Flow Matching (SPFM), a principled approach to filtering unreliable data within the flow-matching framework. SPFM identifies suspicious data using the model itself during the training process, bypassing the need for pretrained models or additional modules. Our experiments demonstrate that models trained with SPFM generate samples that accurately adhere to the specified conditioning, even when trained on noisy labels. Furthermore, we validate the robustness of SPFM on the TITW dataset, which consists of in-the-wild speech data, achieving performance that surpasses existing baselines.

Problem

Research questions and friction points this paper is trying to address.

Filtering unreliable data in flow-matching models with noisy labels

Improving model performance on imperfect training datasets with mislabeled samples

Generating accurate conditioned samples without requiring pretrained models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Purifying Flow Matching filters unreliable data

Identifies suspicious data using the model itself

Bypasses need for pretrained models or modules

🔎 Similar Papers

No similar papers found.