Sample-Efficient Language Model for Hinglish Conversational AI

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sample inefficiency hinders Hinglish (Hindi–English code-mixed) dialogue modeling due to pervasive orthographic inconsistencies, lack of standardization, and scarcity of high-quality annotated data. Method: We propose a sample-efficient fine-tuning paradigm that synergistically integrates synthetic dialogue generation with insights from real data. Leveraging multilingual foundation models—Gemma-3B and Qwen2.5-7B—we incorporate data augmentation, controllable synthetic dialogue injection, and task-adaptive fine-tuning, all trained on limited high-quality Hinglish dialogue data. Contribution/Results: Our approach enables compact models (e.g., 4B parameters) to match or exceed the performance of larger models across multiple dialogue generation metrics. Empirical evaluation shows a 2.3× inference speedup and 58% reduction in GPU memory consumption. To our knowledge, this is the first work to empirically validate the efficient scalability of small-scale multilingual models for low-resource code-mixed dialogue understanding and generation.

Technology Category

Application Category

📝 Abstract
This paper presents our process for developing a sample-efficient language model for a conversational Hinglish chatbot. Hinglish, a code-mixed language that combines Hindi and English, presents a unique computational challenge due to inconsistent spelling, lack of standardization, and limited quality of conversational data. This work evaluates multiple pre-trained cross-lingual language models, including Gemma3-4B and Qwen2.5-7B, and employs fine-tuning techniques to improve performance on Hinglish conversational tasks. The proposed approach integrates synthetically generated dialogues with insights from existing Hinglish datasets to address data scarcity. Experimental results demonstrate that models with fewer parameters, when appropriately fine-tuned on high-quality code-mixed data, can achieve competitive performance for Hinglish conversation generation while maintaining computational efficiency.
Problem

Research questions and friction points this paper is trying to address.

Developing a sample-efficient Hinglish conversational AI model
Addressing data scarcity and inconsistency in Hinglish language
Optimizing performance with limited parameters for computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning pre-trained cross-lingual language models
Integrating synthetic dialogues with existing datasets
Optimizing performance with fewer parameters efficiently
🔎 Similar Papers
No similar papers found.
Sakshi Singh
Sakshi Singh
University of Minnesota
Robotics
Abhinav Prakash
Abhinav Prakash
Staff Data Scientist at Walmart Labs
Statistical ModelingMachine LearningData Science
A
Aakriti Shah
University of Southern California
C
Chaitanya Sachdeva
University of Southern California
S
Sanjana Dumpala
University of Southern California