🤖 AI Summary
Existing post quality assessment methods suffer from three key limitations: (1) unimodal modeling overlooks complementary multimodal cues; (2) deep multimodal fusion introduces modality-specific noise; and (3) they fail to capture complex semantic relationships—such as relevance and comprehensiveness—at fine-grained levels. To address these, we propose a fine-grained multimodal reasoning framework that reformulates the task as a multimodal ranking problem for the first time. Our approach introduces a maximum-information fusion mechanism guided by the information bottleneck principle to suppress modality noise, and integrates dual modules: (i) local–global attention for contextualized feature aggregation, and (ii) macro–micro evidence reasoning to emulate human cognitive processes for nuanced quality discrimination. The model is optimized end-to-end using ranking-aware objectives (e.g., NDCG). Extensive experiments demonstrate significant improvements over state-of-the-art methods across four benchmarks—achieving a 9.52% gain in NDCG@3 on the Art History dataset—validating both effectiveness and generalizability.
📝 Abstract
Accurately assessing post quality requires complex relational reasoning to capture nuanced topic-post relationships. However, existing studies face three major limitations: (1) treating the task as unimodal categorization, which fails to leverage multimodal cues and fine-grained quality distinctions; (2) introducing noise during deep multimodal fusion, leading to misleading signals; and (3) lacking the ability to capture complex semantic relationships like relevance and comprehensiveness. To address these issues, we propose the Multimodal Fine-grained Topic-post Relational Reasoning (MFTRR) framework, which mimics human cognitive processes. MFTRR reframes post-quality assessment as a ranking task and incorporates multimodal data to better capture quality variations. It consists of two key modules: (1) the Local-Global Semantic Correlation Reasoning Module, which models fine-grained semantic interactions between posts and topics at both local and global levels, enhanced by a maximum information fusion mechanism to suppress noise; and (2) the Multi-Level Evidential Relational Reasoning Module, which explores macro- and micro-level relational cues to strengthen evidence-based reasoning. We evaluate MFTRR on three newly constructed multimodal topic-post datasets and the public Lazada-Home dataset. Experimental results demonstrate that MFTRR significantly outperforms state-of-the-art baselines, achieving up to 9.52% NDCG@3 improvement over the best unimodal method on the Art History dataset.