🤖 AI Summary
This study addresses the underrepresentation of truly low-resource languages in large language model pretraining and the limited efficacy of conventional few-shot in-context learning for machine translation into these languages. While multi-example approaches offer potential, they suffer from challenges in example selection and high inference costs. The authors systematically investigate multi-example in-context learning for English-to-ten low-resource language translation and propose an efficient prompting strategy that integrates BM25-based retrieval, cross-domain data utilization, and length-sorted example ordering. Experimental results demonstrate that just 50 BM25-retrieved examples match the performance of 250 randomly selected ones, and 250 retrieved examples nearly attain the effectiveness of 1,000, substantially improving data efficiency and reducing inference overhead. This work thus provides a practical and scalable solution for low-resource machine translation.
📝 Abstract
In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks from a few examples, making it promising for languages underrepresented in pre-training. Recent work on many-shot ICL suggests that modern LLMs can further benefit from larger ICL examples enabled by their long context windows. However, such gains depend on careful example selection, and the inference cost can be prohibitive for low-resource language communities. In this paper, we present an empirical study of many-shot ICL for machine translation from English into ten truly low-resource languages recently added to FLORES+. We analyze the effects of retrieving more informative examples, using out-of-domain data, and ordering examples by length. Our findings show that many-shot ICL becomes more effective as the number of examples increases. More importantly, we show that BM25-based retrieval substantially improves data efficiency: 50 retrieved examples roughly match 250 many-shot examples, while 250 retrieved examples perform similarly to 1,000 many-shot examples.