How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study investigates the feasibility and performance limits of lightweight language models for critical error detection (CED) in English–German machine translation, targeting edge devices and privacy-sensitive applications. Method: We propose a lightweight logit-bias calibration technique and a majority-voting ensemble mechanism, and systematically evaluate models with fewer than 2 billion parameters—including Gemma-3-1B and Qwen-3-0.6B/1.7B—under few-shot fine-tuning and prompt engineering paradigms. Contribution/Results: On the SynCED-EnDe-2025 benchmark, Gemma-3-1B achieves an MCC of 0.77 and F1-ERR of 0.98, with only ~400 ms per-sample latency on a MacBook Pro M4. Models near 1 billion parameters strike the optimal trade-off between semantic error identification capability and inference efficiency. This work provides the first empirical validation of sub-1B-parameter models for high-accuracy, on-device CED—demonstrating their practical viability for localized, privacy-preserving translation quality assessment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.

Problem

Research questions and friction points this paper is trying to address.

Developing compact language models for on-device critical error detection in machine translation

Evaluating sub-2B parameter models for quality-efficiency trade-offs in error detection

Enabling private low-cost error screening with lightweight calibration techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact sub-2B models for on-device error detection

Lightweight logit-bias calibration and majority voting

Fine-tuned 1B parameter models achieving optimal efficiency-quality tradeoff

🔎 Similar Papers

Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices