🤖 AI Summary
Existing multimodal pretraining approaches typically treat all modalities equally, often failing to adequately optimize representations of the modality most critical for downstream tasks. This work proposes a model-agnostic pretraining strategy that, during masked modeling, enhances the masking difficulty, loss weight, and decoder capacity specifically for the target modality, thereby steering the learning process toward the modality required by downstream applications. Notably, this method introduces explicit modality bias during pretraining without modifying the shared encoder or requiring additional supervision. Evaluated on wireless signal constellation diagram tasks, the approach achieves significant improvements in downstream fine-tuning performance using only existing data and computational resources, demonstrating its effectiveness and practicality.
📝 Abstract
Multimodal pretraining is effective for building general-purpose representations, but in many practical deployments, only one modality is heavily used during downstream fine-tuning. Standard pretraining strategies treat all modalities uniformly, which can lead to under-optimized representations for the modality that actually matters. We propose Finetune-Informed Pretraining (FIP), a model-agnostic method that biases representation learning toward a designated target modality needed at fine-tuning time. FIP combines higher masking difficulty, stronger loss weighting, and increased decoder capacity for the target modality, without modifying the shared encoder or requiring additional supervision. When applied to masked modeling on constellation diagrams for wireless signals, FIP consistently improves downstream fine-tuned performance with no extra data or compute. FIP is simple to implement, architecture-compatible, and broadly applicable across multimodal masked modeling pipelines.