EXECUTE: A Multilingual Benchmark for LLM Token Understanding

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) fine-grained lexical understanding—particularly at character, word, and subcharacter (grapheme) levels—across multilingual, multiscript systems (e.g., CJK logographs, kana, hangul, and >10 other scripts). Method: We introduce the first cross-script, token-level lexical understanding benchmark, extending CUTE to 12 languages, and propose a lightweight, scalable evaluation framework. Our methodology integrates script-aware tokenization, subcharacter decomposition, cross-model consistency analysis, and controlled synthetic tasks for precise token-level diagnostics. Contribution/Results: We uncover language-specific understanding bottlenecks: while English exhibits pronounced character-level deficits, Turkish shows robust performance, and CJK languages reveal severe grapheme-level weaknesses (error rates up to 47%). Evaluation across 12 state-of-the-art LLMs validates the framework’s efficacy and reveals fundamental disparities in subword representation fidelity across scripts.

Technology Category

Application Category

📝 Abstract
The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' character understanding in diverse languages
Identifying word-level processing issues in non-English languages
Evaluating sub-character task performance in CJK scripts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends CUTE benchmark to diverse multilingual scripts
Simplified framework for easy language expansion
Tests sub-character tasks in CJK languages
🔎 Similar Papers
No similar papers found.