A Training-free LLM-based Approach to General Chinese Character Error Correction

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Current Chinese spelling correction (CSC) benchmarks focus exclusively on character substitution errors, neglecting frequent insertion and deletion errors—leading to incomplete data coverage, biased evaluation, and limited practical applicability. To address this, we propose a new task: Generalized Chinese Character Error Correction (C2EC), the first to systematically encompass substitution, insertion, and deletion errors, and introduce a high-quality benchmark dataset for it. Methodologically, we design a training-free, zero-shot LLM-based correction framework: leveraging Levenshtein distance to align misaligned character positions and employing a two-stage prompting strategy to guide a 14B-parameter LLM toward precise error repair. Experiments demonstrate that our approach achieves performance on par with fine-tuned models possessing nearly 50× more parameters, both on standard CSC and the new C2EC task. It significantly improves error coverage, robustness, and generalization across diverse error types.

Technology Category

Application Category

📝 Abstract

Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Addresses all types of Chinese character errors.

Improves practicality of Chinese spelling correction.

Leverages large language models without fine-tuning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free prompt-free CSC method

Levenshtein distance for length changes

Prompt-based large language model

🔎 Similar Papers

No similar papers found.