Resolving Sentiment Discrepancy for Multimodal Sentiment Detection via Semantics Completion and Decomposition

๐Ÿ“… 2024-07-09
๐Ÿ›๏ธ Pattern Recognition
๐Ÿ“ˆ Citations: 5
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing single-branch multimodal fusion methods suffer from insufficient modeling capacity and degraded unimodal representations when confronted with cross-modal sentiment discrepanciesโ€”i.e., sentiment inconsistency or contradiction between images and user-generated text. Method: This paper proposes a Semantic Completion and Decomposition (SCD) framework. It bridges the image-text semantic gap via OCR-extracted text (semantic completion) and explicitly separates and models sentiment discrepancies through exclusive modality projections coupled with inter-modal contrastive learning (semantic decomposition). Contribution/Results: For the first time, this work elevates sentiment discrepancy modeling from implicit handling to interpretable, controllable, structured representation. Integrated with cross-modal cross-attention fusion, the SCD framework achieves significant improvements over state-of-the-art methods on four benchmark multimodal sentiment analysis datasets, demonstrating superior effectiveness, robustness, and generalizability in capturing cross-modal sentiment discrepancies.

Technology Category

Application Category

๐Ÿ“ Abstract
With the proliferation of social media posts in recent years, the need to detect sentiments in multimodal (image-text) content has grown rapidly. Since posts are user-generated, the image and text from the same post can express different or even contradictory sentiments, leading to potential extbf{sentiment discrepancy}. However, existing works mainly adopt a single-branch fusion structure that primarily captures the consistent sentiment between image and text. The ignorance or implicit modeling of discrepant sentiment results in compromised unimodal encoding and limited performances. In this paper, we propose a semantics Completion and Decomposition (CoDe) network to resolve the above issue. In the semantics completion module, we complement image and text representations with the semantics of the OCR text embedded in the image, helping bridge the sentiment gap. In the semantics decomposition module, we decompose image and text representations with exclusive projection and contrastive learning, thereby explicitly capturing the discrepant sentiment between modalities. Finally, we fuse image and text representations by cross-attention and combine them with the learned discrepant sentiment for final classification. Extensive experiments conducted on four multimodal sentiment datasets demonstrate the superiority of CoDe against SOTA methods.
Problem

Research questions and friction points this paper is trying to address.

Resolving sentiment discrepancy between image and text in multimodal posts
Complementing representations with in-image text semantics to bridge gaps
Explicitly capturing discrepant sentiments via decomposition and contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses in-image text semantics to complement representations
Decomposes representations via exclusive projection and contrastive learning
Fuses modalities with cross-attention and discrepant sentiment
๐Ÿ”Ž Similar Papers
No similar papers found.