🤖 AI Summary
Multimodal Intent Recognition (MMIR) commonly suffers from weak cross-modal semantic alignment and insufficient robustness under noisy conditions and rare-class scenarios. To address these challenges, we propose a prototype-guided contrastive alignment framework coupled with a coarse-to-fine dynamic attention fusion mechanism. First, class-level prototype representations are constructed to drive cross-modal contrastive learning, thereby enhancing semantic consistency across modalities. Second, a hierarchical attention mechanism is designed to jointly model global intent summaries and fine-grained token-level features, improving both noise robustness and generalization to rare classes. By integrating prototype representation learning, contrastive learning, and dynamic multimodal feature fusion, our method achieves state-of-the-art performance on MIntRec and MIntRec2.0, with rare-class weighted F1 scores improved by 1.05% and 4.18%, respectively—demonstrating significantly enhanced reliability under low-resource and noisy conditions.
📝 Abstract
Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. On MIntRec and MIntRec2.0, MVCL-DAF++ achieves new state-of-the-art results, improving rare-class recognition by +1.05% and +4.18% WF1, respectively. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. The source code is available at https://github.com/chr1s623/MVCL-DAF-PlusPlus.