Comparative Study of Pre-Trained BERT and Large Language Models for Code-Mixed Named Entity Recognition

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges of informal orthography, transliteration ambiguity, and frequent code-switching in Hindi-English (Hinglish) named entity recognition (NER). Methodologically, it systematically evaluates domain-adapted pre-trained models—including HingBERT, HingRoBERTa (specifically pre-trained on Hinglish), IndicBERT, MuRIL, and the generative large language model Gemini—under both supervised fine-tuning and zero-shot settings. The key contributions are: (1) the first empirical demonstration that domain-specific pre-training yields substantial gains for code-mixed NER—HingRoBERTa and HingBERT significantly outperform general multilingual models and the closed-source Gemini under supervised fine-tuning; and (2) the finding that Gemini exhibits strong zero-shot generalization capability. Results substantiate the superiority of domain specialization over scale-driven approaches, establishing a reusable methodological paradigm for low-resource code-mixed NER.

Technology Category

Application Category

📝 Abstract
Named Entity Recognition (NER) in code-mixed text, particularly Hindi-English (Hinglish), presents unique challenges due to informal structure, transliteration, and frequent language switching. This study conducts a comparative evaluation of code-mixed fine-tuned models and non-code-mixed multilingual models, along with zero-shot generative large language models (LLMs). Specifically, we evaluate HingBERT, HingMBERT, and HingRoBERTa (trained on code-mixed data), and BERT Base Cased, IndicBERT, RoBERTa and MuRIL (trained on non-code-mixed multilingual data). We also assess the performance of Google Gemini in a zero-shot setting using a modified version of the dataset with NER tags removed. All models are tested on a benchmark Hinglish NER dataset using Precision, Recall, and F1-score. Results show that code-mixed models, particularly HingRoBERTa and HingBERT-based fine-tuned models, outperform others - including closed-source LLMs like Google Gemini - due to domain-specific pretraining. Non-code-mixed models perform reasonably but show limited adaptability. Notably, Google Gemini exhibits competitive zero-shot performance, underlining the generalization strength of modern LLMs. This study provides key insights into the effectiveness of specialized versus generalized models for code-mixed NER tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating code-mixed versus multilingual models for Hinglish NER
Assessing zero-shot LLM performance on code-mixed named entity recognition
Comparing specialized and generalized models for language-switching text processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned code-mixed BERT models for Hinglish NER
Evaluated zero-shot Google Gemini on untagged dataset
Compared specialized versus generalized multilingual model performance
🔎 Similar Papers
No similar papers found.
M
Mayur Shirke
Dept. of Computer Engineering, Pune Institute of Computer Technology, Pune, India
A
Amey Shembade
Dept. of Computer Engineering, Pune Institute of Computer Technology, Pune, India
P
Pavan Thorat
Dept. of Computer Engineering, Pune Institute of Computer Technology, Pune, India
M
Madhushri Wagh
Dept. of Computer Engineering, Pune Institute of Computer Technology, Pune, India
Raviraj Joshi
Raviraj Joshi
Indian Institute of Technology Madras
computer sciencemachine learningnatural language processing