🤖 AI Summary
Low-resource languages in multilingual open-domain question answering (MLODQA) suffer from high annotation costs and severe data scarcity. To address this, we propose FsModQA: (1) a few-shot multilingual data synthesis paradigm leveraging large language models (LLMs), requiring only five exemplars per language to generate high-quality training data; (2) a cross-lingual prompting strategy that enables zero-shot transfer to unseen languages by exploiting English supervision signals; and (3) joint fine-tuning integrating Wikidata-based self-supervised pretraining with multilingual retrieval. Experiments demonstrate that FsModQA significantly outperforms existing baselines under both few-shot and zero-shot settings, achieving state-of-the-art performance on both cross-lingual and monolingual retrieval tasks.
📝 Abstract
Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a emph{few-shot learning} approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, extsc{FsModQA}, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a emph{cross-lingual prompting} strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.