From Standard English to Singlish: A Retrieval-Augmented Approach for Code-Switched Creole Generation in Large Language Models

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work addresses the challenges of code-switching in natural language generation for contact varieties such as Singaporean English (Singlish), which suffer from scarce parallel data and rapidly evolving lexicons. The authors propose a retrieval-augmented generation (RAG) framework that, for the first time, decouples code-switching into sparse lexical substitutions guided by an externally curated dictionary. This approach enables high semantic fidelity, auditability, and controllability without requiring fine-tuning of large language models. Experimental results demonstrate that the RAG-generated outputs are rated as equally natural as zero-shot prompting in human evaluations. Moreover, automatic metrics show that RAG requires only a median of one edit to achieve a semantic similarity of 0.978, substantially outperforming the baseline, which needs 23 edits and achieves a similarity of 0.926.

📝 Abstract

Code-switching in contact varieties like Singaporean English (Singlish) challenges natural language generation due to limited parallel data and rapid lexical evolution. We propose a retrieval-augmented generation (RAG) framework that externalizes dialectal knowledge into a curated lexicon, enabling controlled lexical code-switching without fine-tuning. Our approach retrieves candidate Singlish expressions and guides generation through sparse lexical substitution. Human evaluation with 164 Singaporean participants found RAG and zero-shot prompting equally natural and appropriate. Automatic analyses reveal different transformation regimes: zero-shot prompting induces extensive paraphrasing (median 23 token edits), whereas RAG performs minimal substitutions (median 1 edit) with higher semantic preservation (mean cosine similarity 0.978 vs. 0.926). Our results demonstrate that externalizing code-switching into lexical resources enables control and auditability without sacrificing perceived quality, offering practical advantages for rapidly evolving contact varieties.

Problem

Research questions and friction points this paper is trying to address.

code-switching

Singlish

lexical evolution

parallel data scarcity

contact varieties

Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented generation

code-switching

lexical substitution