🤖 AI Summary
Existing multimodal comment stance detection (MCSD) research suffers from two critical limitations: pseudo-multimodality—where only the source post contains images while comments are purely textual—and user homogenization—where individual user differences are ignored. To address these issues, we introduce U-MStance, the first user-centric multimodal stance detection dataset. We further propose PRISM, a novel framework featuring: (1) longitudinal user history modeling to capture personalized behavioral patterns; (2) a pioneering vision–text chained reasoning alignment mechanism enabling fine-grained cross-modal inference; and (3) bidirectional, mutually reinforcing learning between stance detection and response generation. Extensive experiments demonstrate that PRISM significantly outperforms strong baselines on U-MStance, validating the effectiveness of user-aware representation learning and context-sensitive multimodal reasoning.
📝 Abstract
The rapid proliferation of multimodal social media content has driven research in Multimodal Conversational Stance Detection (MCSD), which aims to interpret users' attitudes toward specific targets within complex discussions. However, existing studies remain limited by: **1) pseudo-multimodality**, where visual cues appear only in source posts while comments are treated as text-only, misaligning with real-world multimodal interactions; and **2) user homogeneity**, where diverse users are treated uniformly, neglecting personal traits that shape stance expression. To address these issues, we introduce **U-MStance**, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets. We further propose **PRISM**, a **P**ersona-**R**easoned mult**I**modal **S**tance **M**odel for MCSD. PRISM first derives longitudinal user personas from historical posts and comments to capture individual traits, then aligns textual and visual cues within conversational context via Chain-of-Thought to bridge semantic and pragmatic gaps across modalities. Finally, a mutual task reinforcement mechanism is employed to jointly optimize stance detection and stance-aware response generation for bidirectional knowledge transfer. Experiments on U-MStance demonstrate that PRISM yields significant gains over strong baselines, underscoring the effectiveness of user-centric and context-grounded multimodal reasoning for realistic stance understanding.