Minimal Pair-Based Evaluation of Code-Switching

📅 2025-06-02

📈 Citations: 1

✨ Influential: 1

career value

205K/year

🤖 AI Summary

Existing evaluation methods inadequately assess whether large language models (LLMs) authentically emulate bilingual code-switching (CS) behavior, suffering from narrow language coverage, oversimplified phenomenon modeling, and poor scalability. This paper introduces the first naturalistic, minimal-pair CS evaluation framework covering 11 language pairs, integrating human preference experiments with LLM probability distribution analysis to enable cross-linguistically comparable quantification. We propose a novel minimal-pair CS evaluation paradigm; empirically demonstrate— for the first time—that LLMs exhibit strongest linguistic sensitivity on functional-word manipulations (e.g., articles, auxiliaries); confirm that humans significantly prefer naturally occurring CS sentences; and establish a positive correlation between model scale and CS modeling fidelity, with maximal probability disparities concentrated on closed-class word switching pairs.

Technology Category

Application Category

📝 Abstract

There is a lack of an evaluation methodology that estimates the extent to which large language models (LLMs) use code-switching (CS) in the same way as bilinguals. Existing methods do not have wide language coverage, fail to account for the diverse range of CS phenomena, or do not scale. We propose an intervention based on minimal pairs of CS. Each minimal pair contains one naturally occurring CS sentence and one minimally manipulated variant. We collect up to 1,000 such pairs each for 11 language pairs. Our human experiments show that, for every language pair, bilinguals consistently prefer the naturally occurring CS sentence. Meanwhile our experiments with current LLMs show that the larger the model, the more consistently it assigns higher probability to the naturally occurring CS sentence than to the variant. In accordance with theoretical claims, the largest probability differences arise in those pairs where the manipulated material consisted of closed-class words.

Problem

Research questions and friction points this paper is trying to address.

Lack of evaluation method for code-switching in LLMs

Existing methods lack language coverage and scalability

Minimal pairs assess CS usage in bilinguals vs LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimal pairs intervention for code-switching evaluation

Human and LLM comparison on 11 language pairs

Focus on closed-class words for probability differences

🔎 Similar Papers

What can Large Language Models Capture about Code Functional Equivalence?