MultiLexNorm++: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses the significant performance degradation of NLP models on social media text, particularly for Asian languages, due to informal expressions, high linguistic variation, and multilingual dialectal diversity, compounded by the absence of a standardized benchmark for lexical normalization. To bridge this gap, the authors present the first multilingual lexical normalization benchmark encompassing five Asian languages across four writing systems. They propose a generative sequence-to-sequence architecture based on large language models (LLMs), augmented with multilingual preprocessing strategies to effectively tackle the challenges of low-resource settings and high lexical variability. Experimental results demonstrate that the proposed approach substantially outperforms existing state-of-the-art models on newly included languages, and error analysis provides actionable insights for future improvements.

Technology Category

Application Category

📝 Abstract

Social media data has been of interest to Natural Language Processing (NLP) practitioners for over a decade, because of its richness in information, but also challenges for automatic processing. Since language use is more informal, spontaneous, and adheres to many different sociolects, the performance of NLP models often deteriorates. One solution to this problem is to transform data to a standard variant before processing it, which is also called lexical normalization. There has been a wide variety of benchmarks and models proposed for this task. The MultiLexNorm benchmark proposed to unify these efforts, but it consists almost solely of languages from the Indo-European language family in the Latin script. Hence, we propose an extension to MultiLexNorm, which covers 5 Asian languages from different language families in 4 different scripts. We show that the previous state-of-the-art model performs worse on the new languages and propose a new architecture based on Large Language Models (LLMs), which shows more robust performance. Finally, we analyze remaining errors, revealing future directions for this task.

Problem

Research questions and friction points this paper is trying to address.

lexical normalization

Asian languages

social media

NLP

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

lexical normalization

Asian languages

Large Language Models