BatchGEMBA: Token-Efficient Machine Translation Evaluation with Batched Prompting and Prompt Compression

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high token consumption and inefficiency of single-sample prompting in evaluating large language model (LLM) natural language generation, this paper proposes a synergistic framework integrating batched prompting and prompt compression for joint evaluation of multiple translation samples. We introduce the first batch-aware lightweight prompt compression model, specifically designed to reduce computational overhead while mitigating quality degradation induced by batching. Our method incorporates the GEMBA-MQM evaluation metric and demonstrates robustness across multiple LLMs, including GPT-4o. Experimental results show a 2–4× reduction in token usage, with an additional 13%–15% savings from compression. At batch size 4, GPT-4o retains 90.2% of its original evaluation performance—significantly outperforming the uncompressed baseline (55.4%). To our knowledge, this is the first work achieving simultaneous optimization of both evaluation efficiency and quality.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Model (LLM)-based Natural Language Generation evaluation have largely focused on single-example prompting, resulting in significant token overhead and computational inefficiencies. In this work, we introduce BatchGEMBA-MQM, a framework that integrates batched prompting with the GEMBA-MQM metric for machine translation evaluation. Our approach aggregates multiple translation examples into a single prompt, reducing token usage by 2-4 times (depending on the batch size) relative to single-example prompting. Furthermore, we propose a batching-aware prompt compression model that achieves an additional token reduction of 13-15% on average while also showing ability to help mitigate batching-induced quality degradation. Evaluations across several LLMs (GPT-4o, GPT-4o-mini, Mistral Small, Phi4, and CommandR7B) and varying batch sizes reveal that while batching generally negatively affects quality (but sometimes not substantially), prompt compression does not degrade further, and in some cases, recovers quality loss. For instance, GPT-4o retains over 90% of its baseline performance at a batch size of 4 when compression is applied, compared to a 44.6% drop without compression. We plan to release our code and trained models at https://github.com/NL2G/batchgemba to support future research in this domain.
Problem

Research questions and friction points this paper is trying to address.

Reduces token overhead in machine translation evaluation
Improves computational efficiency with batched prompting
Mitigates quality degradation through prompt compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Batched prompting reduces token usage significantly.
Prompt compression model minimizes token overhead further.
Maintains high performance with large batch sizes.
🔎 Similar Papers
No similar papers found.