🤖 AI Summary
To address the significant performance degradation of multimodal visual recognition models under modality missing, this paper proposes SyP, a synergistic prompting framework. SyP dynamically generates adaptable prompts via learnable adapters and jointly integrates static and dynamic prompts to enable adaptive modeling of incomplete inputs. It further introduces multimodal feature scaling and hybrid prompt fusion to collaboratively balance cross-modal information in end-to-end training. SyP overcomes key limitations of conventional static prompting—namely, inflexibility in handling heterogeneous missing patterns and unstable prompt tuning when critical modalities are absent. Extensive experiments on three mainstream visual recognition benchmarks demonstrate that SyP consistently outperforms state-of-the-art methods across diverse missing rates and missing patterns, achieving superior robustness and generalization.
📝 Abstract
Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks by leveraging extensive paired multi-modal training data. However, in real-world applications, the presence of missing or incomplete modality inputs often leads to significant performance degradation. Recent research has focused on prompt-based strategies to tackle this issue; however, existing methods are hindered by two major limitations: (1) static prompts lack the flexibility to adapt to varying missing-data conditions, and (2) basic prompt-tuning methods struggle to ensure reliable performance when critical modalities are missing.To address these challenges, we propose a novel Synergistic Prompting (SyP) framework for robust visual recognition with missing modalities. The proposed SyP introduces two key innovations: (I) a Dynamic Adapter, which computes adaptive scaling factors to dynamically generate prompts, replacing static parameters for flexible multi-modal adaptation, and (II) a Synergistic Prompting Strategy, which combines static and dynamic prompts to balance information across modalities, ensuring robust reasoning even when key modalities are missing. The proposed SyP achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Extensive experiments and ablation studies validate its effectiveness in handling missing modalities, highlighting its superior adaptability and reliability.