Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer

📅 2024-08-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance degradation of large language models (LLMs) in code generation under non-English prompts—a key barrier to global inclusivity. We propose a zero-shot cross-lingual transfer method featuring a novel neural projection-based alignment mechanism that directly aligns the LASER cross-lingual encoder with the LLM’s token embedding space, requiring neither multilingual annotated data nor prompt translation or supervised fine-tuning. Evaluated on a human-verified multilingual MBPP benchmark using CodeLlama and CodeGemma architectures, our approach achieves an average 32.7% improvement in pass@1 rate for non-English prompts—substantially outperforming translation-based baselines, data augmentation, and supervised fine-tuning. The core contribution is an efficient, lightweight, unsupervised alignment paradigm that enables LLMs trained exclusively on English data to generalize robustly to multilingual prompts.

Technology Category

Application Category

📝 Abstract
The use of Large Language Models (LLMs) for program code generation has gained substantial attention, but their biases and limitations with non-English prompts challenge global inclusivity. This paper investigates the complexities of multilingual prompt-based code generation. Our evaluations of LLMs, including CODELLAMA and CODEGEMMA, reveal significant disparities in code quality for non-English prompts; we also demonstrate the inadequacy of simple approaches like prompt translation, bootstrapped data augmentation, and fine-tuning. To address this, we propose a zero-shot cross-lingual approach using a neural projection technique, integrating a cross-lingual encoder like LASER to map multilingual embeddings from it into the LLM's token space. This method requires training only on English data and scales effectively to other languages. Results on a translated and quality-checked MBPP dataset show substantial improvements in code quality. This research promotes a more inclusive code generation landscape by empowering LLMs with multilingual capabilities to support the diverse linguistic spectrum in programming.
Problem

Research questions and friction points this paper is trying to address.

Addressing code quality disparities in non-English prompts for LLMs
Overcoming limitations of simple translation and fine-tuning methods
Enhancing multilingual code generation via zero-shot cross-lingual transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot cross-lingual transfer for code generation
Neural projection maps multilingual embeddings
LASER encoder integrates with LLM token space
🔎 Similar Papers
No similar papers found.
M
Mingda Li
School of Information, University of Texas at Austin
Abhijit Mishra
Abhijit Mishra
Assistant Professor of Practice, iSchool, University of Texas at Austin
Machine LearningNatural Language ProcessingCognitive ScienceEye-Tracking
U
Utkarsh Mujumdar
School of Information, University of Texas at Austin