L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource Indian languages lack fine-grained semantic evaluation benchmarks, hindering rigorous assessment of sentence embedding models. To address this, we introduce the first news headline identification dataset covering ten low-resource Indian languages, each containing 20,000 news articles and four semantically controlled headline variants—explicitly categorized as semantically equivalent, near-synonymous, locally perturbed, or semantically irrelevant—to enable fine-grained semantic discrimination and retrieval-augmented generation (RAG) evaluation. Annotations were rigorously validated by human experts. We benchmarked multilingual and monolingual Sentence Transformers via cosine similarity matching. Results demonstrate that multilingual models exhibit strong cross-lingual robustness, whereas monolingual models yield only marginal gains. The dataset is publicly released and supports broader applications, including multiple-choice question answering, headline classification, and semantic understanding evaluation for large language models.

Technology Category

Application Category

📝 Abstract
Semantic evaluation in low-resource languages remains a major challenge in NLP. While sentence transformers have shown strong performance in high-resource settings, their effectiveness in Indic languages is underexplored due to a lack of high-quality benchmarks. To bridge this gap, we introduce L3Cube-IndicHeadline-ID, a curated headline identification dataset spanning ten low-resource Indic languages: Marathi, Hindi, Tamil, Gujarati, Odia, Kannada, Malayalam, Punjabi, Telugu, Bengali and English. Each language includes 20,000 news articles paired with four headline variants: the original, a semantically similar version, a lexically similar version, and an unrelated one, designed to test fine-grained semantic understanding. The task requires selecting the correct headline from the options using article-headline similarity. We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity. Results show that multilingual models consistently perform well, while language-specific models vary in effectiveness. Given the rising use of similarity models in Retrieval-Augmented Generation (RAG) pipelines, this dataset also serves as a valuable resource for evaluating and improving semantic understanding in such applications. Additionally, the dataset can be repurposed for multiple-choice question answering, headline classification, or other task-specific evaluations of LLMs, making it a versatile benchmark for Indic NLP. The dataset is shared publicly at https://github.com/l3cube-pune/indic-nlp
Problem

Research questions and friction points this paper is trying to address.

Addressing semantic evaluation challenges in low-resource Indic languages
Benchmarking sentence transformers for headline identification tasks
Providing dataset for fine-grained semantic understanding evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created headline dataset for 10 Indic languages
Used multilingual sentence transformers for benchmarking
Designed for RAG pipelines and semantic evaluation
🔎 Similar Papers
No similar papers found.
N
Nishant Tanksale
Department of Information Technology, PICT, Pune and L3Cube Labs, Pune
T
Tanmay Kokate
Department of Information Technology, PICT, Pune and L3Cube Labs, Pune
D
Darshan Gohad
Department of Information Technology, PICT, Pune and L3Cube Labs, Pune
S
Sarvadnyaa Barate
Department of Information Technology, PICT, Pune and L3Cube Labs, Pune
Raviraj Joshi
Raviraj Joshi
Indian Institute of Technology Madras
computer sciencemachine learningnatural language processing