Structured-Noise Masked Modeling for Video, Audio and Beyond

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing self-supervised learning methods employ random masking, which disregards the intrinsic spatiotemporal (for video) or spectral (for audio) structural properties of multimodal signals, thereby limiting representation quality. To address this, we propose a physics-aware structured noise masking approach: colored noise masks—tailored to each modality’s physical characteristics—are generated via white noise filtering, enabling fully data-free, prior-free, and modality-adaptive mask design. This is the first work to introduce physically interpretable structured noise into masking-based representation learning, preserving the structural constraints of the original signal. Extensive experiments on video and audio self-supervised masked modeling demonstrate that our method significantly outperforms random masking baselines, yielding consistent and substantial gains in downstream task performance—without incurring any additional computational overhead.

Technology Category

Application Category

📝 Abstract

Masked modeling has emerged as a powerful self-supervised learning framework, but existing methods largely rely on random masking, disregarding the structural properties of different modalities. In this work, we introduce structured noise-based masking, a simple yet effective approach that naturally aligns with the spatial, temporal, and spectral characteristics of video and audio data. By filtering white noise into distinct color noise distributions, we generate structured masks that preserve modality-specific patterns without requiring handcrafted heuristics or access to the data. Our approach improves the performance of masked video and audio modeling frameworks without any computational overhead. Extensive experiments demonstrate that structured noise masking achieves consistent improvement over random masking for standard and advanced masked modeling methods, highlighting the importance of modality-aware masking strategies for representation learning.

Problem

Research questions and friction points this paper is trying to address.

Improves masked modeling by using structured noise-based masking.

Aligns masking with spatial, temporal, and spectral data characteristics.

Enhances performance of video and audio modeling without computational overhead.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured noise-based masking for video and audio

Filters white noise into color noise distributions

Improves masked modeling without computational overhead

🔎 Similar Papers

Face Mask Removal with Region-attentive Face Inpainting