🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) fine-grained lexical understanding—particularly at character, word, and subcharacter (grapheme) levels—across multilingual, multiscript systems (e.g., CJK logographs, kana, hangul, and >10 other scripts).
Method: We introduce the first cross-script, token-level lexical understanding benchmark, extending CUTE to 12 languages, and propose a lightweight, scalable evaluation framework. Our methodology integrates script-aware tokenization, subcharacter decomposition, cross-model consistency analysis, and controlled synthetic tasks for precise token-level diagnostics.
Contribution/Results: We uncover language-specific understanding bottlenecks: while English exhibits pronounced character-level deficits, Turkish shows robust performance, and CJK languages reveal severe grapheme-level weaknesses (error rates up to 47%). Evaluation across 12 state-of-the-art LLMs validates the framework’s efficacy and reveals fundamental disparities in subword representation fidelity across scripts.
📝 Abstract
The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.