L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Low-resource Indian languages lack fine-grained semantic evaluation benchmarks, hindering rigorous assessment of sentence embedding models. To address this, we introduce the first news headline identification dataset covering ten low-resource Indian languages, each containing 20,000 news articles and four semantically controlled headline variants—explicitly categorized as semantically equivalent, near-synonymous, locally perturbed, or semantically irrelevant—to enable fine-grained semantic discrimination and retrieval-augmented generation (RAG) evaluation. Annotations were rigorously validated by human experts. We benchmarked multilingual and monolingual Sentence Transformers via cosine similarity matching. Results demonstrate that multilingual models exhibit strong cross-lingual robustness, whereas monolingual models yield only marginal gains. The dataset is publicly released and supports broader applications, including multiple-choice question answering, headline classification, and semantic understanding evaluation for large language models.

Technology Category

Application Category

📝 Abstract

Semantic evaluation in low-resource languages remains a major challenge in NLP. While sentence transformers have shown strong performance in high-resource settings, their effectiveness in Indic languages is underexplored due to a lack of high-quality benchmarks. To bridge this gap, we introduce L3Cube-IndicHeadline-ID, a curated headline identification dataset spanning ten low-resource Indic languages: Marathi, Hindi, Tamil, Gujarati, Odia, Kannada, Malayalam, Punjabi, Telugu, Bengali and English. Each language includes 20,000 news articles paired with four headline variants: the original, a semantically similar version, a lexically similar version, and an unrelated one, designed to test fine-grained semantic understanding. The task requires selecting the correct headline from the options using article-headline similarity. We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity. Results show that multilingual models consistently perform well, while language-specific models vary in effectiveness. Given the rising use of similarity models in Retrieval-Augmented Generation (RAG) pipelines, this dataset also serves as a valuable resource for evaluating and improving semantic understanding in such applications. Additionally, the dataset can be repurposed for multiple-choice question answering, headline classification, or other task-specific evaluations of LLMs, making it a versatile benchmark for Indic NLP. The dataset is shared publicly at https://github.com/l3cube-pune/indic-nlp

Problem

Research questions and friction points this paper is trying to address.

Addressing semantic evaluation challenges in low-resource Indic languages

Benchmarking sentence transformers for headline identification tasks

Providing dataset for fine-grained semantic understanding evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created headline dataset for 10 Indic languages

Used multilingual sentence transformers for benchmarking

Designed for RAG pipelines and semantic evaluation

🔎 Similar Papers

No similar papers found.