IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the notable gap in instruction-following evaluation benchmarks, which have predominantly focused on English and largely neglected Indic languages. We present the first structured, automatically verifiable instruction-following benchmark covering 14 Indic languages, integrating localized content and cross-lingual tasks. Data quality is ensured through human-verified translations, synthetic instruction generation grounded in native-language corpora, and a rule-based automatic validation mechanism. Evaluation results reveal that while current large language models perform reasonably well on format-constrained tasks, they exhibit significant deficiencies in lexical accuracy and cross-lingual transfer, resulting in substantially lower instruction-following capabilities across Indic languages compared to English.

Technology Category

Application Category

📝 Abstract

Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800 human-verified examples per language spread across two complementary subsets: IndicIFEval-Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval-Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross-lingual tasks -- and despite progress in high-resource languages, instruction-following across the broader Indic family lags significantly behind English. We release IndicIFEval and its evaluation scripts to support progress on multilingual constrained generation (http://github.com/ai4bharat/IndicIFEval).

Problem

Research questions and friction points this paper is trying to address.

instruction-following

Indic languages

multilingual evaluation

benchmark

constrained generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-following evaluation

Indic languages

verifiable constrained generation

multilingual benchmark

localized prompt adaptation

🔎 Similar Papers

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

2024-10-03arXiv.orgCitations: 1