🤖 AI Summary
Existing classical Chinese poetry sentiment analysis predominantly relies on textual semantics, neglecting prosodic (recitation audio) and visual (accompanying paintings) modalities. This paper proposes the first multimodal framework for classical poetry sentiment understanding that jointly models “sound,” “form,” and “meaning.” It introduces, for the first time, multi-dialect recitation audio to capture historical phonological affective cues; integrates generative Chinese painting–style visual representations with CLIP-style cross-modal encoding; and enhances classical Chinese textual representation via large language model–augmented translation. A novel Multimodal Contrastive Learning (MMCLR) strategy is designed to enable synergistic perception across audio, visual, and textual modalities. Evaluated on two public benchmarks, our method achieves ≥2.51% absolute accuracy gain and ≥1.63% macro-F1 improvement. The code is publicly released, establishing a new computational humanities paradigm for classical poetry sentiment analysis.
📝 Abstract
Classical Chinese poetry is a vital and enduring part of Chinese literature, conveying profound emotional resonance. Existing studies analyze sentiment based on textual meanings, overlooking the unique rhythmic and visual features inherent in poetry,especially since it is often recited and accompanied by Chinese paintings. In this work, we propose a dialect-enhanced multimodal framework for classical Chinese poetry sentiment analysis. We extract sentence-level audio features from the poetry and incorporate audio from multiple dialects,which may retain regional ancient Chinese phonetic features, enriching the phonetic representation. Additionally, we generate sentence-level visual features, and the multimodal features are fused with textual features enhanced by LLM translation through multimodal contrastive representation learning. Our framework outperforms state-of-the-art methods on two public datasets, achieving at least 2.51% improvement in accuracy and 1.63% in macro F1. We open-source the code to facilitate research in this area and provide insights for general multimodal Chinese representation.