🤖 AI Summary
This study investigates the feasibility and performance limits of lightweight language models for critical error detection (CED) in English–German machine translation, targeting edge devices and privacy-sensitive applications. Method: We propose a lightweight logit-bias calibration technique and a majority-voting ensemble mechanism, and systematically evaluate models with fewer than 2 billion parameters—including Gemma-3-1B and Qwen-3-0.6B/1.7B—under few-shot fine-tuning and prompt engineering paradigms. Contribution/Results: On the SynCED-EnDe-2025 benchmark, Gemma-3-1B achieves an MCC of 0.77 and F1-ERR of 0.98, with only ~400 ms per-sample latency on a MacBook Pro M4. Models near 1 billion parameters strike the optimal trade-off between semantic error identification capability and inference efficiency. This work provides the first empirical validation of sub-1B-parameter models for high-accuracy, on-device CED—demonstrating their practical viability for localized, privacy-preserving translation quality assessment.
📝 Abstract
Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.