MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish

📅 2024-12-21

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address cross-lingual understanding bottlenecks in large language models (LLMs) for low-resource, high-divergence languages—including Chinese, Indonesian, Malay, and Singaporean English (Singlish)—this work proposes a four-language collaborative optimization framework integrating continual pretraining and parameter fusion. Built upon the Llama-3-8B-Base architecture, our approach innovatively combines unsupervised multilingual alignment, domain-adaptive continual pretraining, multi-stage parameter merging, and cross-lingual instruction tuning. Evaluated on comprehensive benchmarks across all four languages, the resulting model consistently outperforms the original Llama-3-8B, delivering substantial gains in both comprehension and generation capabilities for low-resource languages. All model weights are publicly released, providing a reproducible, scalable technical pathway and foundational resource for multilingual LLM research and deployment.

Technology Category

Application Category

📝 Abstract

Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Resource-poor Languages

Multilingual Performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

MERaLiON-TextLLM

Multilingual_LM_Optimization

Low_Resource_Language_Processing

🔎 Similar Papers

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models