Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing subword tokenization methods (e.g., BPE) employed by LLMs obscure internal character structures within tokens, leading to insufficient character-level positional awareness and hindering tasks requiring precise character localization—such as Chinese spelling correction (CSC). To address this, we propose Token Internal Position Awareness (TIPA), the first approach enabling explicit token-level structural and positional disentanglement without modifying the tokenizer or model architecture. TIPA introduces a reverse character prediction pretraining objective, grounded in the tokenizer’s vocabulary, to explicitly model intra-token character position relationships. On CSC benchmarks, TIPA achieves a +3.2% F1 improvement over strong baselines; it also significantly enhances character position prediction accuracy. Moreover, it demonstrates strong generalization to other character-sensitive downstream tasks—including OCR post-processing and classical Chinese text collation—validating its robustness and broad applicability.

Technology Category

Application Category

📝 Abstract

Tokenization methods like Byte-Pair Encoding (BPE) enhance computational efficiency in large language models (LLMs) but often obscure internal character structures within tokens. This limitation hinders LLMs' ability to predict precise character positions, which is crucial in tasks like Chinese Spelling Correction (CSC) where identifying the positions of misspelled characters accelerates correction processes. We propose Token Internal Position Awareness (TIPA), a method that significantly improves models' ability to capture character positions within tokens by training them on reverse character prediction tasks using the tokenizer's vocabulary. Experiments demonstrate that TIPA enhances position prediction accuracy in LLMs, enabling more precise identification of target characters in original text. Furthermore, when applied to downstream tasks that do not require exact position prediction, TIPA still boosts performance in tasks needing character-level information, validating its versatility and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Improving character position prediction in LLM tokens

Enhancing Chinese spelling correction via character awareness

Boosting performance in character-level downstream tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Internal Position Awareness (TIPA) method

Trains on reverse character prediction tasks

Improves character position accuracy in tokens

🔎 Similar Papers

Unsupervised Morphological Tree Tokenizer