π€ AI Summary
This work addresses the challenges of code-switching in natural language generation for contact varieties such as Singaporean English (Singlish), which suffer from scarce parallel data and rapidly evolving lexicons. The authors propose a retrieval-augmented generation (RAG) framework that, for the first time, decouples code-switching into sparse lexical substitutions guided by an externally curated dictionary. This approach enables high semantic fidelity, auditability, and controllability without requiring fine-tuning of large language models. Experimental results demonstrate that the RAG-generated outputs are rated as equally natural as zero-shot prompting in human evaluations. Moreover, automatic metrics show that RAG requires only a median of one edit to achieve a semantic similarity of 0.978, substantially outperforming the baseline, which needs 23 edits and achieves a similarity of 0.926.
π Abstract
Code-switching in contact varieties like Singaporean English (Singlish) challenges natural language generation due to limited parallel data and rapid lexical evolution. We propose a retrieval-augmented generation (RAG) framework that externalizes dialectal knowledge into a curated lexicon, enabling controlled lexical code-switching without fine-tuning. Our approach retrieves candidate Singlish expressions and guides generation through sparse lexical substitution. Human evaluation with 164 Singaporean participants found RAG and zero-shot prompting equally natural and appropriate. Automatic analyses reveal different transformation regimes: zero-shot prompting induces extensive paraphrasing (median 23 token edits), whereas RAG performs minimal substitutions (median 1 edit) with higher semantic preservation (mean cosine similarity 0.978 vs. 0.926). Our results demonstrate that externalizing code-switching into lexical resources enables control and auditability without sacrificing perceived quality, offering practical advantages for rapidly evolving contact varieties.