Improving Multilingual Social Media Insights: Aspect-based Comment Analysis

πŸ“… 2025-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Social media comments exhibit linguistic informality, high noise levels, and heterogeneous viewpoints, severely hindering cross-lingual opinion understanding tasks such as clustering and summarization. To address this, we propose a fine-grained Comment Aspect Term Generation (CAT-G) framework. Our method comprises three key components: (i) constructing the first multilingual CAT-G benchmark covering English, Chinese, Malay, and Indonesian; (ii) designing an aspect-driven attention-guided mechanism tailored for noisy text modeling; and (iii) jointly applying supervised fine-tuning and Direct Preference Optimization (DPO) to align model outputs with human cognitive preferences. Experiments across all four languages demonstrate substantial improvements in cross-lingual comment clustering and summarization quality. Moreover, our framework enables systematic evaluation of multilingual large language models’ capabilities, establishing a novel paradigm and publicly available benchmark resource for multilingual social media understanding.

Technology Category

Application Category

πŸ“ Abstract
The inherent nature of social media posts, characterized by the freedom of language use with a disjointed array of diverse opinions and topics, poses significant challenges to downstream NLP tasks such as comment clustering, comment summarization, and social media opinion analysis. To address this, we propose a granular level of identifying and generating aspect terms from individual comments to guide model attention. Specifically, we leverage multilingual large language models with supervised fine-tuning for comment aspect term generation (CAT-G), further aligning the model's predictions with human expectations through DPO. We demonstrate the effectiveness of our method in enhancing the comprehension of social media discourse on two NLP tasks. Moreover, this paper contributes the first multilingual CAT-G test set on English, Chinese, Malay, and Bahasa Indonesian. As LLM capabilities vary among languages, this test set allows for a comparative analysis of performance across languages with varying levels of LLM proficiency.
Problem

Research questions and friction points this paper is trying to address.

Challenges in analyzing multilingual social media comments due to diverse language use
Need for aspect term generation to improve comment clustering and summarization
Lack of multilingual test sets for evaluating comment aspect term generation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual LLMs with supervised fine-tuning
Aspect term generation guided by model attention
DPO alignment for human-like predictions
πŸ”Ž Similar Papers
No similar papers found.
L
Longyin Zhang
Institute for Infocomm Research, A*STAR, Singapore
B
Bowei Zou
Institute for Infocomm Research, A*STAR, Singapore
AiTi Aw
AiTi Aw
Aw Ai Ti