๐ค AI Summary
Current Chinese spelling correction (CSC) benchmarks focus exclusively on character substitution errors, neglecting frequent insertion and deletion errorsโleading to incomplete data coverage, biased evaluation, and limited practical applicability. To address this, we propose a new task: Generalized Chinese Character Error Correction (C2EC), the first to systematically encompass substitution, insertion, and deletion errors, and introduce a high-quality benchmark dataset for it. Methodologically, we design a training-free, zero-shot LLM-based correction framework: leveraging Levenshtein distance to align misaligned character positions and employing a two-stage prompting strategy to guide a 14B-parameter LLM toward precise error repair. Experiments demonstrate that our approach achieves performance on par with fine-tuned models possessing nearly 50ร more parameters, both on standard CSC and the new C2EC task. It significantly improves error coverage, robustness, and generalization across diverse error types.
๐ Abstract
Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.