TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant limitations in token-level fine-grained understanding and structured reasoning, hindering their deployment in precision-critical control tasks. To address this, we propose TASE—the first multilingual (Chinese/English/Korean) benchmark explicitly designed for token-aware perception and structural understanding—comprising 35,927 synthetically generated instances across tasks including character counting, cross-lingual alignment, and syntactic parsing. We introduce a scalable synthetic data generation pipeline and a structured prompt-based evaluation framework, validated via multilingual aligned annotations and GRPO-finetuned Qwen2.5-14B. Comprehensive evaluation of over 30 state-of-the-art models reveals substantial performance gaps relative to human baselines, exposing pervasive deficiencies in modeling underlying linguistic structure. TASE is publicly released to serve as a standardized diagnostic and development tool for advancing multilingual fine-grained capabilities.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning--capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase .

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' token-level understanding across languages

Assesses structural reasoning in multilingual language models

Identifies weaknesses in fine-grained, low-level language tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token awareness benchmark for multilingual LLMs

Scalable synthetic data generation pipeline

GRPO training method for custom model

🔎 Similar Papers

No similar papers found.