Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaran'i

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual sociolinguistic annotation of code-switching (CS) in low-resource bilingual contexts is time-intensive and lacks generalizability. Method: This study pioneers the application of large language models (LLMs) to automatic sociolinguistic annotation of Spanish–English and Spanish–Guarani CS discourse, introducing an LLM-driven multitask annotation pipeline that integrates corpus metadata to jointly identify topic, genre, pragmatic function, and speaker-level sociolinguistic variables (gender, language dominance). Contribution/Results: Evaluated on 3,691 CS utterances, the approach reveals systematic associations among gender, language dominance, and discourse function, and uncovers a stratified pattern in Paraguayan contexts where Guarani indexes formality and Spanish indexes informality. By overcoming bottlenecks of manual annotation, the method enhances efficiency and scalability for cross-lingual, low-resource CS research while preserving interpretability—advancing computational sociolinguistics methodology.

Technology Category

Application Category

📝 Abstract
This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaran'i. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaran'i dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaran'i and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.
Problem

Research questions and friction points this paper is trying to address.

Automatically annotates code-switched discourse for sociolinguistic patterns
Analyzes links between demographics and discourse functions in bilingual data
Extends computational methods to low-resource bilingual contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-assisted annotation pipeline for bilingual discourse analysis
Automated labeling of topic, genre, and discourse-pragmatic functions
Integrating demographic metadata and enriching low-resource datasets
🔎 Similar Papers
No similar papers found.