Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

📅 2024-11-26
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing subword tokenization methods (e.g., BPE) employed by LLMs obscure internal character structures within tokens, leading to insufficient character-level positional awareness and hindering tasks requiring precise character localization—such as Chinese spelling correction (CSC). To address this, we propose Token Internal Position Awareness (TIPA), the first approach enabling explicit token-level structural and positional disentanglement without modifying the tokenizer or model architecture. TIPA introduces a reverse character prediction pretraining objective, grounded in the tokenizer’s vocabulary, to explicitly model intra-token character position relationships. On CSC benchmarks, TIPA achieves a +3.2% F1 improvement over strong baselines; it also significantly enhances character position prediction accuracy. Moreover, it demonstrates strong generalization to other character-sensitive downstream tasks—including OCR post-processing and classical Chinese text collation—validating its robustness and broad applicability.

Technology Category

Application Category

📝 Abstract
Tokenization methods like Byte-Pair Encoding (BPE) enhance computational efficiency in large language models (LLMs) but often obscure internal character structures within tokens. This limitation hinders LLMs' ability to predict precise character positions, which is crucial in tasks like Chinese Spelling Correction (CSC) where identifying the positions of misspelled characters accelerates correction processes. We propose Token Internal Position Awareness (TIPA), a method that significantly improves models' ability to capture character positions within tokens by training them on reverse character prediction tasks using the tokenizer's vocabulary. Experiments demonstrate that TIPA enhances position prediction accuracy in LLMs, enabling more precise identification of target characters in original text. Furthermore, when applied to downstream tasks that do not require exact position prediction, TIPA still boosts performance in tasks needing character-level information, validating its versatility and effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Improving character position prediction in LLM tokens
Enhancing Chinese spelling correction via character awareness
Boosting performance in character-level downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Internal Position Awareness (TIPA) method
Trains on reverse character prediction tasks
Improves character position accuracy in tokens
🔎 Similar Papers
2024-06-21arXiv.orgCitations: 0
Zhuo Xu
Zhuo Xu
Wuhan University
Multi-sensor fusion positioningvisual SLAM
Zhiqiang Zhao
Zhiqiang Zhao
School of Software Engineering, Chongqing University of Posts and Telecommunications
Z
Zihan Zhang
School of Software Engineering, Chongqing University of Posts and Telecommunications
Yuchi Liu
Yuchi Liu
Tsinghua University
Q
Quanwei Shen
School of Software Engineering, Chongqing University of Posts and Telecommunications
F
Fei Liu
Baidu AI Platform & Ecosystem
Y
Yu Kuang
School of Software Engineering, Chongqing University of Posts and Telecommunications