🤖 AI Summary
Existing approaches to multimodal dialogue understanding often fail to explicitly model the dependency between contextual history and the current utterance. This work proposes CUCI-Net, which, for the first time, formulates context-utterance dependencies as interpretable cues. By employing structured encoding to differentiate contextual and utterance representations, the model integrates local modality-specific evidence with global contextual evidence to generate guiding cues. These cues then drive a cue-guided multimodal interaction mechanism. Evaluated on mainstream multimodal dialogue benchmark datasets, CUCI-Net significantly outperforms existing state-of-the-art models, demonstrating both its effectiveness and methodological novelty.
📝 Abstract
Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, or propagation, but rarely abstract the context-utterance dependency into an explicit cue and incorporate it into later multimodal reasoning. To address this issue, we propose CUCI-Net for conversational multimodal understanding. CUCI-Net fully preserves the structural distinction between context and utterance during encoding, effectively abstracts their dependency into an interpretation cue by combining local modality evidence with global contextual evidence, and seamlessly integrates the resulting cue into the final multimodal interaction stage for context-conditioned prediction. Extensive experiments on mainstream benchmark datasets fully demonstrate the effectiveness of the proposed method.