🤖 AI Summary
This study addresses the challenge posed by “abstract language”—a highly dynamic and context-dependent form of Chinese internet subcultural discourse—to the comprehension and generation capabilities of large language models (LLMs). To systematically evaluate mainstream LLMs, the authors introduce Mouse, the first multi-task benchmark specifically designed for Chinese abstract language, encompassing six understanding and generation tasks. Evaluation combines LLM-as-a-judge scoring, human assessment, and error attribution analysis. Results reveal that current state-of-the-art models perform adequately only in contextual semantic understanding, while significantly underperforming on other tasks. The study also uncovers a notable misalignment between model judgments and human values. The code and dataset are publicly released to foster further research in this domain.
📝 Abstract
While large language models (LLMs) have achieved remarkable success in general language tasks, their performance on Chouxiang Language, a representative subcultural language in the Chinese internet context, remains largely unexplored. In this paper, we introduce Mouse, a specialized benchmark designed to evaluate the capabilities of LLMs on NLP tasks involving Chouxiang Language across six tasks. Experimental results show that, current state-of-the-art (SOTA) LLMs exhibit clear limitations on multiple tasks, while performing well on tasks that involve contextual semantic understanding. In addition, we further discuss the reasons behind the generally low performance of SOTA LLMs on Chouxiang Language, examine whether the LLM-as-a-judge approach adopted for translation tasks aligns with human judgments and values, and analyze the key factors that influence Chouxiang translation. Our study aims to promote further research in the NLP community on multicultural integration and the dynamics of evolving internet languages. Our code and data are publicly available.