🤖 AI Summary
Existing subword tokenization methods (e.g., BPE) employed by LLMs obscure internal character structures within tokens, leading to insufficient character-level positional awareness and hindering tasks requiring precise character localization—such as Chinese spelling correction (CSC). To address this, we propose Token Internal Position Awareness (TIPA), the first approach enabling explicit token-level structural and positional disentanglement without modifying the tokenizer or model architecture. TIPA introduces a reverse character prediction pretraining objective, grounded in the tokenizer’s vocabulary, to explicitly model intra-token character position relationships. On CSC benchmarks, TIPA achieves a +3.2% F1 improvement over strong baselines; it also significantly enhances character position prediction accuracy. Moreover, it demonstrates strong generalization to other character-sensitive downstream tasks—including OCR post-processing and classical Chinese text collation—validating its robustness and broad applicability.
📝 Abstract
Tokenization methods like Byte-Pair Encoding (BPE) enhance computational efficiency in large language models (LLMs) but often obscure internal character structures within tokens. This limitation hinders LLMs' ability to predict precise character positions, which is crucial in tasks like Chinese Spelling Correction (CSC) where identifying the positions of misspelled characters accelerates correction processes. We propose Token Internal Position Awareness (TIPA), a method that significantly improves models' ability to capture character positions within tokens by training them on reverse character prediction tasks using the tokenizer's vocabulary. Experiments demonstrate that TIPA enhances position prediction accuracy in LLMs, enabling more precise identification of target characters in original text. Furthermore, when applied to downstream tasks that do not require exact position prediction, TIPA still boosts performance in tasks needing character-level information, validating its versatility and effectiveness.