Leveraging Domain Knowledge at Inference Time for LLM Translation: Retrieval versus Generation

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient domain adaptation capability in machine translation (MT) for specialized domains such as medicine and law. Method: We systematically compare two inference-time domain knowledge injection paradigms—retrieval-based (few-shot examples or terminology) versus generative (LLM-synthesized knowledge)—on professional-domain MT tasks. Contribution/Results: We empirically demonstrate that retrieved few-shot examples significantly outperform both terminology injection and LLM-generated knowledge. Multi-domain benchmarks primarily evaluate writing style transfer rather than genuine domain adaptation. Remarkably, few-shot examples generated by a weaker model (Llama-3-8B) achieve 96% of the zero-shot performance of a stronger model (Llama-3-70B). Retrieval-based few-shot prompting yields an average BLEU improvement of +4.2 points across professional-domain MT tasks, establishing a highly efficient, low-cost paradigm for test-time domain adaptation.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) have been increasingly adopted for machine translation (MT), their performance for specialist domains such as medicine and law remains an open challenge. Prior work has shown that LLMs can be domain-adapted at test-time by retrieving targeted few-shot demonstrations or terminologies for inclusion in the prompt. Meanwhile, for general-purpose LLM MT, recent studies have found some success in generating similarly useful domain knowledge from an LLM itself, prior to translation. Our work studies domain-adapted MT with LLMs through a careful prompting setup, finding that demonstrations consistently outperform terminology, and retrieval consistently outperforms generation. We find that generating demonstrations with weaker models can close the gap with larger model's zero-shot performance. Given the effectiveness of demonstrations, we perform detailed analyses to understand their value. We find that domain-specificity is particularly important, and that the popular multi-domain benchmark is testing adaptation to a particular writing style more so than to a specific domain.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM translation in specialist domains like medicine and law.

Comparing retrieval versus generation for domain-specific knowledge integration.

Analyzing the effectiveness of demonstrations over terminologies in prompts.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval outperforms generation for domain adaptation

Demonstrations enhance LLM translation performance effectively

Domain-specificity crucial for effective machine translation

🔎 Similar Papers

No similar papers found.