One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge of aligning user roles with instructions in low-resource Indian languages, where existing embedding models struggle due to the absence of a unified evaluation benchmark. We propose the first multilingual role-instruction alignment framework that decouples retrieval and generation, introducing a standardized benchmark covering 12 Indian languages and encompassing four tasks: monolingual and cross-lingual role-instruction retrieval, as well as binary compatibility classification. Under a frozen encoder setup with a lightweight logistic regression head, we systematically evaluate eight prominent multilingual embedding models. Experimental results show that E5-Large-Instruct achieves Recall@1 of 27.4% and 20.7% in monolingual and cross-lingual retrieval, respectively; BGE-M3 attains 32.1% in reverse retrieval; and LaBSE reaches 75.3% AUROC in classification, substantially advancing evaluation capabilities for low-resource Indian languages.

Technology Category

Application Category

📝 Abstract

Aligning multilingual assistants with culturally grounded user preferences is essential for serving India's linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona-instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross-lingual persona-to-instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen-encoder setting with a thin logistic regression head for classification. E5-Large-Instruct achieves the highest Recall@1 of 27.4\% on monolingual retrieval and 20.7\% on cross-lingual transfer, while BGE-M3 leads reverse retrieval at 32.1\% Recall@1. For classification, LaBSE attains 75.3\% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnote{Code, datasets, and models are publicly available at https://github.com/aryashah2k/PI-Indic-Align.

Problem

Research questions and friction points this paper is trying to address.

embedding alignment

personas

instructions

low-resource languages

Indian languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual embeddings

persona-instruction alignment

low-resource languages