How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility and performance limits of lightweight language models for critical error detection (CED) in English–German machine translation, targeting edge devices and privacy-sensitive applications. Method: We propose a lightweight logit-bias calibration technique and a majority-voting ensemble mechanism, and systematically evaluate models with fewer than 2 billion parameters—including Gemma-3-1B and Qwen-3-0.6B/1.7B—under few-shot fine-tuning and prompt engineering paradigms. Contribution/Results: On the SynCED-EnDe-2025 benchmark, Gemma-3-1B achieves an MCC of 0.77 and F1-ERR of 0.98, with only ~400 ms per-sample latency on a MacBook Pro M4. Models near 1 billion parameters strike the optimal trade-off between semantic error identification capability and inference efficiency. This work provides the first empirical validation of sub-1B-parameter models for high-accuracy, on-device CED—demonstrating their practical viability for localized, privacy-preserving translation quality assessment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.
Problem

Research questions and friction points this paper is trying to address.

Developing compact language models for on-device critical error detection in machine translation
Evaluating sub-2B parameter models for quality-efficiency trade-offs in error detection
Enabling private low-cost error screening with lightweight calibration techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact sub-2B models for on-device error detection
Lightweight logit-bias calibration and majority voting
Fine-tuned 1B parameter models achieving optimal efficiency-quality tradeoff
🔎 Similar Papers
No similar papers found.
M
M. Chopra
University of Bonn - Department of Computer Science, Germany
Lorenz Sparrenberg
Lorenz Sparrenberg
University of Bonn
LLMData ScienceRepresentation Learning
S
Sarthak Khanna
University of Bonn - Department of Computer Science, Germany
R
R. Sifa
Fraunhofer IAIS - Department of Media Engineering, Germany