🤖 AI Summary
Large language models (LLMs) exhibit significant limitations in token-level fine-grained understanding and structured reasoning, hindering their deployment in precision-critical control tasks. To address this, we propose TASE—the first multilingual (Chinese/English/Korean) benchmark explicitly designed for token-aware perception and structural understanding—comprising 35,927 synthetically generated instances across tasks including character counting, cross-lingual alignment, and syntactic parsing. We introduce a scalable synthetic data generation pipeline and a structured prompt-based evaluation framework, validated via multilingual aligned annotations and GRPO-finetuned Qwen2.5-14B. Comprehensive evaluation of over 30 state-of-the-art models reveals substantial performance gaps relative to human baselines, exposing pervasive deficiencies in modeling underlying linguistic structure. TASE is publicly released to serve as a standardized diagnostic and development tool for advancing multilingual fine-grained capabilities.
📝 Abstract
While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning--capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase .