Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data

๐Ÿ“… 2024-02-12
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the critical challenge of lacking emotion-controllable multimodal feedback generation in human-computer interaction (HCI). We propose the first controllable multimodal feedback synthesis framework, which fuses textual and visual features (extracted via Faster R-CNN), employs a cross-modal decoder, integrates a plug-and-play emotion control module, and introduces a ranking-based feedback relevance assessment mechanism alongside feature-contribution interpretability techniques. We construct and publicly release the large-scale CMFeed dataset and open-source implementation. Experiments demonstrate an emotion classification accuracy of 77.23%โ€”an 18.82 percentage-point improvement over state-of-the-art baselinesโ€”while enabling fine-grained emotion modulation and quantitative relevance evaluation of generated feedback. Our framework significantly enhances empathetic engagement and practical utility in HCI applications across education, healthcare, and customer service domains.

Technology Category

Application Category

๐Ÿ“ Abstract
The ability to generate sentiment-controlled feedback in response to multimodal inputs comprising text and images addresses a critical gap in human-computer interaction. This capability allows systems to provide empathetic, accurate, and engaging responses, with useful applications in education, healthcare, marketing, and customer service. To this end, we have constructed a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and propose a controllable feedback synthesis system. The system features an encoder, decoder, and controllability block for textual and visual inputs. It extracts features using a transformer and Faster R-CNN networks, combining them to generate feedback. The CMFeed dataset includes images, texts, reactions to the posts, human comments with relevance scores, and reactions to these comments. These reactions train the model to produce feedback with specified sentiments, achieving a sentiment classification accuracy of 77.23%, which is 18.82% higher than the accuracy without controllability. The system also incorporates a similarity module for assessing feedback relevance through rank-based metrics and an interpretability technique to analyze the contributions of textual and visual features during feedback generation. Access to the CMFeed dataset and the system's code is available at https://github.com/MIntelligence-Group/CMFeed.
Problem

Research questions and friction points this paper is trying to address.

Generating sentiment-controlled feedback for multimodal text-image inputs
Addressing the gap in empathetic human-computer interaction systems
Creating controllable feedback synthesis with high sentiment accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer and Faster R-CNN extract multimodal features
Encoder-decoder with controllability block manages sentiment
CMFeed dataset trains model for sentiment-controlled feedback
๐Ÿ”Ž Similar Papers
No similar papers found.